[Solved] Search for PDF files with content:

Discussion related to "Everything" 1.5 Alpha.
Post Reply
w64bit
Posts: 161
Joined: Wed Jan 09, 2013 9:06 am

[Solved] Search for PDF files with content:

Post by w64bit » Wed Jan 19, 2022 10:40 am

I tried to search for PDF files containing some text in pages.
I installed Adobe PDF iFilter 64 v11.0.01
With 1.5.0.1295a x64 I receive: Querying ... 0 objects even if the text is present in PDF.
Last edited by w64bit on Fri Jan 21, 2022 3:28 pm, edited 2 times in total.

NotNull
Posts: 3648
Joined: Wed May 24, 2017 9:22 pm

Re: Search for PDF files with content:

Post by NotNull » Wed Jan 19, 2022 11:18 am

1. What search-query did you use/
2. What is defined under Menu:Tools > Options > Indexes > Content?

w64bit
Posts: 161
Joined: Wed Jan 09, 2013 9:06 am

Re: Search for PDF files with content:

Post by w64bit » Wed Jan 19, 2022 11:28 am

D: *.pdf content:text
no index file content
I am trying to avoid index file content and to use search inside the files/querying when I need, even it takes longer. It is working with DWG files, as I have DWG ifilter installed with AutoCAD.

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: Search for PDF files with content:

Post by void » Wed Jan 19, 2022 11:53 am

What OS are you using?
-I would like to test my end.

The Adobe PDF iFilter might not like the COM multithreaded concurrency model, please try disabling content_pdf_ifilter_coinit_multithreaded:
  • In Everything, type in the following search and press ENTER:
    /content_pdf_ifilter_coinit_multithreaded=0
  • If success content_pdf_ifilter_coinit_multithreaded=0 is shown in the statusbar for a few seconds.
  • Please restart Everything, type in the following search and press ENTER:
    /restart-now

Everything might be getting stuck on a specific file.
Please try running Everything in verbose debug mode:
  • In Everything, from the Tools menu, under the Debug submenu, check Console.
  • From the Tools menu, under the Debug submenu, check Verbose.
  • Perform your pdf content search.
  • What is shown in the Debug console when Everything gets stuck showing Querying...

w64bit
Posts: 161
Joined: Wed Jan 09, 2013 9:06 am

Re: Search for PDF files with content:

Post by w64bit » Wed Jan 19, 2022 12:49 pm

Win 10 21H2 x64

/content_pdf_ifilter_coinit_multithreaded=0 did not help. Same Querying ... 0 objects.

Debug for content_pdf_ifilter_coinit_multithreaded=1
Last edited by void on Thu Jan 20, 2022 9:54 am, edited 1 time in total.
Reason: remove logs

NotNull
Posts: 3648
Joined: Wed May 24, 2017 9:22 pm

Re: Search for PDF files with content:

Post by NotNull » Wed Jan 19, 2022 3:50 pm

EDIT:
I mixed up a couple of things. Removed my answer to avoid sending other people reading this in the wrong direction ...

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: Search for PDF files with content:

Post by void » Thu Jan 20, 2022 9:54 am

Thank you for the debug logs w64bit,

From memory, w64bit runs as the true admin.


failed to load stream
It looks like the Query eventually completes (after ~60 seconds)
The issue is Everything is not finding content in PDF files at all.

The iFilter fails to load my file stream.
I'm unsure of the reason...



Everything 1.5.0.1296a adds more debug information.

Could you please try running this Everything version in verbose debug mode again:
  • In Everything, from the Tools menu, under the Debug submenu, check Console.
  • From the Tools menu, under the Debug submenu, check Verbose.
  • Perform your pdf content search.
  • What is shown in the Debug console after the query completes?
This version should log:
failed to load stream <filename> <failure-reason>

w64bit
Posts: 161
Joined: Wed Jan 09, 2013 9:06 am

Re: Search for PDF files with content:

Post by w64bit » Thu Jan 20, 2022 12:01 pm

With:
Adobe PDF iFilter 64 v11.0.01
content_ifilter=1
content_pdf_ifilter_coinit_multithreaded=0
content_ifilter_coinit_multithreaded=0
=> failure-reason 80004005

If I uninstall Adobe PDF iFilter and use Windows default iFilter, searching does not get stuck but fail to find all PDF files. Some of them are missing from result list.
It seems that with Windows default iFilter it can find PDF files by typing first 3 letters of a word, but nothing if I type 4 or more letters.

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: Search for PDF files with content:

Post by void » Fri Jan 21, 2022 12:37 am

Thank you for the debug info w64bit,

80004005 is a generic error code.
It's common issue with the Adobe PDF ifilter.

Everything is already using IPersistStream.

I am testing this my end and will get back to you.


Could you please send me a PDF file where a 4 letter word doesn't match (with the default iFilter) to support@voidtools.com
It's most likely a bad break injecting a space or newline.

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: Search for PDF files with content:

Post by void » Fri Jan 21, 2022 6:08 am

Everything 1.5.0.1297a fixes an issue with the Adobe PDF iFilter not loading correctly.

The Adobe PDF iFilter is loading a dll dependency from the current directory.
Everything 1.5 previously prevented this type of dll loading.

w64bit
Posts: 161
Joined: Wed Jan 09, 2013 9:06 am

Re: Search for PDF files with content:

Post by w64bit » Fri Jan 21, 2022 7:46 am

this file has a problem with first e
Attachments
text - Pieces.zip
(58.33 KiB) Downloaded 64 times

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: Search for PDF files with content:

Post by void » Fri Jan 21, 2022 10:40 am

Thank you for the test1.pdf sample.

This appears to work fine for me on Windows 10 21H1 with the stock PDF iFilter.

Do you have any Search options enabled under the Search menu?



Could you please send the verbose debug output when searching content in this file?:
  • In Everything, from the Tools menu, under the Debug submenu, check Console.
  • From the Tools menu, under the Debug submenu, check Verbose.
  • Search for:
    Test1.pdf content:test
  • What is shown in the Debug console after the query completes?

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: Search for PDF files with content:

Post by void » Fri Jan 21, 2022 10:42 am

Another trick with Everything 1.5 that might be helpful here:
  • In Everything, search for:
    Test1.pdf dotall:regex:content:(.*)
  • Show the Regular Expression Match 1 column.
  • What is shown for you in this column?

w64bit
Posts: 161
Joined: Wed Jan 09, 2013 9:06 am

Re: Search for PDF files with content:

Post by w64bit » Fri Jan 21, 2022 11:41 am

No Search options enabled under the Search menu.
Test1.pdf dotall:regex:content:(.*) + Regular Expression Match 1 => nothing found in search list
debug.txt attached

It seems that it has to do with my fresh install of Win 10 21H2 x64 from a clean ISO, dated dec 2021.
I don't remember this PDF problem on my previous install of Win 10 21H2 obtained from 20H2 x64 + all updates done by WU.
Last edited by void on Fri Jan 21, 2022 11:46 am, edited 1 time in total.
Reason: removed debug logs

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: Search for PDF files with content:

Post by void » Fri Jan 21, 2022 11:46 am

LoadIFilter Test1.pdf 80004005
Thanks for the debug logs.
The PDF iFilter straight up fails to load. (Generic Failure error code)


The stock PDF iFilter does not like running in a STA thread.

Please try re-enabling content_pdf_ifilter_coinit_multithreaded:
  • In Everything, type in the following search and press ENTER:
    /content_pdf_ifilter_coinit_multithreaded=1
  • If success content_pdf_ifilter_coinit_multithreaded=1 is shown in the statusbar for a few seconds.
  • Please restart Everything, type in the following search and press ENTER:
    /restart-now
Does the issue persist?

w64bit
Posts: 161
Joined: Wed Jan 09, 2013 9:06 am

Re: Search for PDF files with content:

Post by w64bit » Fri Jan 21, 2022 11:59 am

I checked, added and corrected PDF PersistentHandler registry entries and now it's all OK with Win 10 default IFilter.
Thank you very much.

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: [Solved] Search for PDF files with content:

Post by void » Sun Jan 23, 2022 3:13 am

Thanks for the update w64bit,

I am glad to hear PDF content searching is working now.

defza
Posts: 22
Joined: Thu Apr 18, 2019 12:49 pm

Re: [Solved] Search for PDF files with content:

Post by defza » Sat Apr 16, 2022 3:45 pm

Hi,

I would like to show the files that have the result of "failed to load stream".
Is this possible?
In my case, they are the one's that I'm looking for, basically, dirty or bad or textless PDF's. They usually have this attribute of not being able to be loaded.

I saw error 8004807a in the logs in my case.

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: [Solved] Search for PDF files with content:

Post by void » Mon Apr 18, 2022 3:06 am

Everything doesn't have a search function that will find PDF files that fail to load.

Everything 1.5 might help to find bad PDF files with the new PDF properties:
  • In Everything 1.5, right click the result list column header and click Add columns....
  • Click application/pdf on the left.
  • Select all properties and click OK.
  • Examine these new columns in Everything for your PDF files. (eg: search for *.pdf )
    A missing File Signature will definitely indicate a bad PDF file.
I will consider a 'has preview' property.
Thank you for the suggestion.

defza
Posts: 22
Joined: Thu Apr 18, 2019 12:49 pm

Re: [Solved] Search for PDF files with content:

Post by defza » Mon Apr 18, 2022 9:27 am

Thanks, but in this case, the file shows application/pdf as File SIgnature and content-type just fine.

But in the log it says

Code: Select all

 failed to load stream F:\Seminary\TheologyBible\theolibrary.shc.edu\www.duq.edu\documents\theology\_pdf\faculty-publications\Bulletin_of_Ecumenical_Theology_21_2009.pdf 8004807a
It seems that the ifilter (content search) fails, but then it reverts to searching the file as raw text literal file contents as a fall back, because it returns the actual raw file content in my regex search, whereas a pdf file with actual body content text (in this case, appended with -text.pdf) will return body text and not raw text, see screenshot:
Image

Here is the source file that I'm trying to detect that it doesn't have any content in it. My usual search of regex:content:\A\s*\z or regex:content:^$ doesn't work for this case, because it returns the raw contents of the it seems.

Update:
Oh, I think this would work: if I search for regex:content:^\%PDF or content:%PDF then it finds this file that fails as an actual pdf, but searches the file as binary/bytes/raw.
Update2:
But this doesn't really solve the problem, because it still causes a full index search of the whole file for those files that are "valid" pdf's, just to be able to search for the regex that I'm looking for.
I guess what I want to do is not index the whole file contents, but just search the the start of the pdf of those that have an issue, or just only the start of the pdf's that have full text, so that the content search fails quicker on the big files with no match and then moves onto the next file.


Here's the source pdf file: https://www.duq.edu/Documents/theology/ ... 1_2009.pdf
------------
Last edited by defza on Mon Apr 18, 2022 1:58 pm, edited 1 time in total.

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: [Solved] Search for PDF files with content:

Post by void » Mon Apr 18, 2022 10:37 am

Error 3 is path not found.

Is the U: drive online?
Does Everything have access to your U: drive? (are you running Everything as an administrator?)

Does forcing a rebuild from Tools -> Options -> Indexes -> Force Rebuild.

horst.epp
Posts: 723
Joined: Fri Apr 04, 2014 3:24 pm

Re: [Solved] Search for PDF files with content:

Post by horst.epp » Mon Apr 18, 2022 12:08 pm

Your example pdf file can't be indexed as it contains no searchable text at all.
Try to select some text from it in your PDF tool and you will see.
If I run my PDF-XChange OCR tool on it it makes it fully searchable and can be indexed with Everything.
Also its size shrinks this way from 9.1 MB to 1.1 MB
Btw. the orignal has some structural errors which can be fixed.

defza
Posts: 22
Joined: Thu Apr 18, 2019 12:49 pm

Re: [Solved] Search for PDF files with content:

Post by defza » Mon Apr 18, 2022 2:00 pm

void wrote:
Mon Apr 18, 2022 10:37 am
Error 3 is path not found.

Is the U: drive online?
Does Everything have access to your U: drive? (are you running Everything as an administrator?)

Does forcing a rebuild from Tools -> Options -> Indexes -> Force Rebuild.
Sorry that was my mistake, ignore that part, i'ved edit it out now. It was a copy of the file on a non-existant disk.
:oops:
Your example pdf file can't be indexed as it contains no searchable text at all.
Yes, I know, because that's exactly the type of files that I'm trying to find with an everything search.

defza
Posts: 22
Joined: Thu Apr 18, 2019 12:49 pm

Re: [Solved] Search for PDF files with content:

Post by defza » Mon Apr 18, 2022 3:17 pm

void wrote:
Mon Apr 18, 2022 10:37 am
Error 3 is path not found.

Is the U: drive online?
Does Everything have access to your U: drive? (are you running Everything as an administrator?)

Does forcing a rebuild from Tools -> Options -> Indexes -> Force Rebuild.
Actually, I've got a global filter on showing results only from the available online: only files... strange that it was trying to access the file on that drive?
Seems that maybe content searches are ignoring the online attribute?

horst.epp
Posts: 723
Joined: Fri Apr 04, 2014 3:24 pm

Re: [Solved] Search for PDF files with content:

Post by horst.epp » Mon Apr 18, 2022 5:42 pm

I use a modified script from NotNull to create a list of files which need OCR.
It uses pdftotext tool and creates a file need_ocr.txt in the dir with the PDFs.
Currently it adds an unwanted space on the end of every name in the list.

Code: Select all

@echo off
setlocal
rem echo on
pushd "%~dp0"
cls
::____________________________________________________________
::
::				SETTINGS
::____________________________________________________________
::
	chcp 1252
	set OUT-List=.\need_ocr.txt
	del %OUT-LIST%

::____________________________________________________________
::
::				ACTION!
::____________________________________________________________
::

	for %%X in (*.pdf) do (
		echo.    [%%X]
		C:\Tools\xpdf-tools\pdftotext.exe -simple "%%X" .\checkthis.txt
		for %%C in (checkthis.txt) DO if %%~zC LSS 25 ( echo %~dp0%%X>>"%OUT-List%" )
		del checkthis.txt
	)
pause
goto :EOF

void
David Carpenter (Developer)
Posts: 9378
Joined: Fri Oct 16, 2009 11:31 pm

Re: [Solved] Search for PDF files with content:

Post by void » Tue Apr 19, 2022 7:11 am

Actually, I've got a global filter on showing results only from the available online: only files... strange that it was trying to access the file on that drive?
Seems that maybe content searches are ignoring the online attribute?
online: also matches files where the online status is unknown.
I will change online: in the next alpha update to match only files that known to be online.

defza
Posts: 22
Joined: Thu Apr 18, 2019 12:49 pm

Re: [Solved] Search for PDF files with content:

Post by defza » Tue Apr 19, 2022 5:04 pm

horst.epp wrote:
Mon Apr 18, 2022 5:42 pm
I use a modified script from NotNull to create a list of files which need OCR.
It uses pdftotext tool and creates a file need_ocr.txt in the dir with the PDFs.
Currently it adds an unwanted space on the end of every name in the list.

Code: Select all

@echo off
setlocal
rem echo on
pushd "%~dp0"
cls
::____________________________________________________________
::
::				SETTINGS
::____________________________________________________________
::
	chcp 1252
	set OUT-List=.\need_ocr.txt
	del %OUT-LIST%

::____________________________________________________________
::
::				ACTION!
::____________________________________________________________
::

	for %%X in (*.pdf) do (
		echo.    [%%X]
		C:\Tools\xpdf-tools\pdftotext.exe -simple "%%X" .\checkthis.txt
		for %%C in (checkthis.txt) DO if %%~zC LSS 25 ( echo %~dp0%%X>>"%OUT-List%" )
		del checkthis.txt
	)
pause
goto :EOF
I believe Everything itself can find all pdf's that are just images or are corrupt this with these search two queries:
1. regex:content:\A\s*\z (finds the pdf's with no content i.e. just whitespace returned) or regex:content:^$
2. regex:content:^\%PDF (This works by trying to search the content, when it fails, it reverts to raw binary search, and then finds the pdf header. This will only happen with files for which the ifilter/system pdf search returns an error)

Post Reply