Current state of advanced PDF conversion on Linux and/or the Web?
Thread poster: Artem Vakhitov
Artem Vakhitov
Artem Vakhitov  Identity Verified
Kyrgyzstan
English to Russian
+ ...
Jul 2, 2014

From time to time I'm dabbling with Linux (currently it is installed on my desktop PC which I use for home tasks). As it is now, I could theoretically do a sizeable chunk of work on Linux using CafeTran or Heartsome Translation Studio, but there is one key thing that I haven't yet figured out how to do on this platform: high-quality PDF conversion. I depend on that heavily. Normally I use ABBYY PDF Transformer 2.0 (the last good version of that software) to convert PDF files. Sometimes I fire up... See more
From time to time I'm dabbling with Linux (currently it is installed on my desktop PC which I use for home tasks). As it is now, I could theoretically do a sizeable chunk of work on Linux using CafeTran or Heartsome Translation Studio, but there is one key thing that I haven't yet figured out how to do on this platform: high-quality PDF conversion. I depend on that heavily. Normally I use ABBYY PDF Transformer 2.0 (the last good version of that software) to convert PDF files. Sometimes I fire up Finereader 11 Pro for that. Besides accurate automatic text conversion, I need the ability to select image, text, and table areas and place them in a specific order. No automatic conversion routine can reliably guess the correct logical sequence and type of items - manual selection is key here.

So the question is: does anybody know of any Linux- or at least Web-based software solutions capable of that? My attempts at googling haven't yielded anything of the sorts.
Collapse


 
Lehti
Lehti
Italy
Local time: 12:37
English to Italian
+ ...
kind of Jul 4, 2014

PDF conversion on Linux is a pain. Tesseract is a quite good OCR software, but has its limits. First off, you should save the pages in your scanned PDF as tiff, fiddle around with GIMP to transform it in black and white and have a good contrast between the background and the text (basically, Tesseract works best if the background is completely white and the text completely black), and then paste the content of the files you're getting inside OpenOffice/LibreOffice. It's time-consuming at best.... See more
PDF conversion on Linux is a pain. Tesseract is a quite good OCR software, but has its limits. First off, you should save the pages in your scanned PDF as tiff, fiddle around with GIMP to transform it in black and white and have a good contrast between the background and the text (basically, Tesseract works best if the background is completely white and the text completely black), and then paste the content of the files you're getting inside OpenOffice/LibreOffice. It's time-consuming at best.

Good news is that you can either:
a) virtualize a Windows machine, if your computer's specs are capable to support one. You just need to open a terminal window and type
sudo apt-get install virtualbox
b) install FineReader on your Linux machine using Wine. It seems (but I'm not able to test it) that as of version 1.5.5 your program runs just fine.
https://appdb.winehq.org/objectManager.php?sClass=application&iId=1035
Collapse


 
RNAtranslator
RNAtranslator  Identity Verified
Local time: 12:37
English to Spanish
+ ...
Web base applications and privacy Jul 10, 2014

Artem Vakhitov wrote:

So the question is: does anybody know of any Linux- or at least Web-based software solutions


Please, remember: the documents you translate do not belong to you, but to your clients; therefore, you should never expose them to other people's eyes, and that is what you do when you use web based applications on them. You should never use web based applications on your client's document unless he grants you explicit permission to do so.


 
esperantisto
esperantisto  Identity Verified
Local time: 13:37
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
FineReader / WFA Jul 10, 2014

FineReader. Yes, it runs under Linux (via Crossover, at least FR 7).

And Wordfast Anywhere sometimes produces quite good results of PDF conversion.

[Edited at 2014-07-10 20:24 GMT]


 
esperantisto
esperantisto  Identity Verified
Local time: 13:37
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
Dead PDFs? Jul 10, 2014

BTW, do you need conversion of ‘dead’ PDFs (i. e. with pages being images) or PDFs with text layers? Apache OpenOffice / LibreOffice can import the latter, but the results are not always very good.

 
Artem Vakhitov
Artem Vakhitov  Identity Verified
Kyrgyzstan
English to Russian
+ ...
TOPIC STARTER
Thank you for the replies Jul 10, 2014

Thank you all who responded to my query. A lot of useful info here.

I don't want to bother with virtualization, because then I might as well stay on Windows full time.

Some versions of FineReader do work under Wine, but not the exact one I have (v.11 Pro). The Wine app compatibility database has a "Garbage" rating for it.

An excellent point was made by RNATranslator regarding confidentiality and web-based apps. However, in cases where the source material is
... See more
Thank you all who responded to my query. A lot of useful info here.

I don't want to bother with virtualization, because then I might as well stay on Windows full time.

Some versions of FineReader do work under Wine, but not the exact one I have (v.11 Pro). The Wine app compatibility database has a "Garbage" rating for it.

An excellent point was made by RNATranslator regarding confidentiality and web-based apps. However, in cases where the source material is published openly on the Web anyway, this isn't an issue.

Thanks to esperantisto for noting that WFA can sometimes produce good results with PDFs. I guess it uses something like Solid engine internally. I need some manual control though.

Another user in a private message suggested that I try Infix PDF Editor for PDF conversion, adding that the developers specifically ensure that it runs under Wine. Even though I feel that it is overall more like the Solid engine mentioned above, I will try it.

Finally, I do need to convert "dead" PDF from time to time. And I did try LibreOffice for PDFs with a text layer and didn't like the results, judging that manually copying text and images from Adobe Reader and partly retyping would be faster in that case.
Collapse


 
José Henrique Lamensdorf
José Henrique Lamensdorf  Identity Verified
Brazil
Local time: 07:37
English to Portuguese
+ ...
In memoriam
On Infix Jul 11, 2014

Artem Vakhitov wrote:

Another user in a private message suggested that I try Infix PDF Editor for PDF conversion, adding that the developers specifically ensure that it runs under Wine. Even though I feel that it is overall more like the Solid engine mentioned above, I will try it.


As a beta-tester of Infix, I must warn you that it uses one of the slowest OCR engines I've ever seen (I use OmniPage).

However the real value in Infix Pro for translators is its use on "distilled", aka software-generated, aka "editable" PDF files.

The concept is that you'll export tagged XML or TXT (your choice) text and tag your PDF (better use a copy of it) as well. Then you'll translate that XML or TXT with your favorite CAT tool, if any. Next, you'll use Infix Pro to import that translated XML or TXT in place. The tags will ensure the right fonts, sizes, colors, places etc. Finally, Infix offers you all the DTP tools to fix the layout on the PDF after translation.

Of course, this is an oversimplification of the entire process. You may see a somewhat more detailed walkthrough from an actual project I did at http://www.lamensdorf.com.br/translating-a-pdf.html .

The big advantage is that you no longer must have (and know how to use) InDesign, PageMaker, FrameMaker, and QuarkXPress to serve all cients in your language pair, as long as they can provide you with a PDF. You'll also cover the amateurish MS Publisher, Serif PagePlus and Scribus. Furthermore, those pesky-layout MS Word files are easier to translate as PDFs using Infix, because Word lacks most DTP tools. The good news is that Infix costs only a fraction of one professional DTP program.


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Prachya Mruetusatorn[Call to this topic]

You can also contact site staff by submitting a support request »

Current state of advanced PDF conversion on Linux and/or the Web?






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »