Current state of advanced PDF conversion on Linux and/or the Web? Thread poster: Artem Vakhitov
|
From time to time I'm dabbling with Linux (currently it is installed on my desktop PC which I use for home tasks). As it is now, I could theoretically do a sizeable chunk of work on Linux using CafeTran or Heartsome Translation Studio, but there is one key thing that I haven't yet figured out how to do on this platform: high-quality PDF conversion. I depend on that heavily. Normally I use ABBYY PDF Transformer 2.0 (the last good version of that software) to convert PDF files. Sometimes I fire up... See more From time to time I'm dabbling with Linux (currently it is installed on my desktop PC which I use for home tasks). As it is now, I could theoretically do a sizeable chunk of work on Linux using CafeTran or Heartsome Translation Studio, but there is one key thing that I haven't yet figured out how to do on this platform: high-quality PDF conversion. I depend on that heavily. Normally I use ABBYY PDF Transformer 2.0 (the last good version of that software) to convert PDF files. Sometimes I fire up Finereader 11 Pro for that. Besides accurate automatic text conversion, I need the ability to select image, text, and table areas and place them in a specific order. No automatic conversion routine can reliably guess the correct logical sequence and type of items - manual selection is key here. So the question is: does anybody know of any Linux- or at least Web-based software solutions capable of that? My attempts at googling haven't yielded anything of the sorts. ▲ Collapse | | | Lehti Italy Local time: 12:37 English to Italian + ...
PDF conversion on Linux is a pain. Tesseract is a quite good OCR software, but has its limits. First off, you should save the pages in your scanned PDF as tiff, fiddle around with GIMP to transform it in black and white and have a good contrast between the background and the text (basically, Tesseract works best if the background is completely white and the text completely black), and then paste the content of the files you're getting inside OpenOffice/LibreOffice. It's time-consuming at best.... See more PDF conversion on Linux is a pain. Tesseract is a quite good OCR software, but has its limits. First off, you should save the pages in your scanned PDF as tiff, fiddle around with GIMP to transform it in black and white and have a good contrast between the background and the text (basically, Tesseract works best if the background is completely white and the text completely black), and then paste the content of the files you're getting inside OpenOffice/LibreOffice. It's time-consuming at best. Good news is that you can either: a) virtualize a Windows machine, if your computer's specs are capable to support one. You just need to open a terminal window and type sudo apt-get install virtualbox b) install FineReader on your Linux machine using Wine. It seems (but I'm not able to test it) that as of version 1.5.5 your program runs just fine. https://appdb.winehq.org/objectManager.php?sClass=application&iId=1035 ▲ Collapse | | | Web base applications and privacy | Jul 10, 2014 |
Artem Vakhitov wrote: So the question is: does anybody know of any Linux- or at least Web-based software solutions Please, remember: the documents you translate do not belong to you, but to your clients; therefore, you should never expose them to other people's eyes, and that is what you do when you use web based applications on them. You should never use web based applications on your client's document unless he grants you explicit permission to do so. | | | esperantisto Local time: 13:37 Member (2006) English to Russian + ... SITE LOCALIZER FineReader / WFA | Jul 10, 2014 |
FineReader. Yes, it runs under Linux (via Crossover, at least FR 7). And Wordfast Anywhere sometimes produces quite good results of PDF conversion.
[Edited at 2014-07-10 20:24 GMT] | |
|
|
esperantisto Local time: 13:37 Member (2006) English to Russian + ... SITE LOCALIZER
BTW, do you need conversion of ‘dead’ PDFs (i. e. with pages being images) or PDFs with text layers? Apache OpenOffice / LibreOffice can import the latter, but the results are not always very good. | | | Thank you for the replies | Jul 10, 2014 |
Thank you all who responded to my query. A lot of useful info here. I don't want to bother with virtualization, because then I might as well stay on Windows full time. Some versions of FineReader do work under Wine, but not the exact one I have (v.11 Pro). The Wine app compatibility database has a "Garbage" rating for it. An excellent point was made by RNATranslator regarding confidentiality and web-based apps. However, in cases where the source material is... See more Thank you all who responded to my query. A lot of useful info here. I don't want to bother with virtualization, because then I might as well stay on Windows full time. Some versions of FineReader do work under Wine, but not the exact one I have (v.11 Pro). The Wine app compatibility database has a "Garbage" rating for it. An excellent point was made by RNATranslator regarding confidentiality and web-based apps. However, in cases where the source material is published openly on the Web anyway, this isn't an issue. Thanks to esperantisto for noting that WFA can sometimes produce good results with PDFs. I guess it uses something like Solid engine internally. I need some manual control though. Another user in a private message suggested that I try Infix PDF Editor for PDF conversion, adding that the developers specifically ensure that it runs under Wine. Even though I feel that it is overall more like the Solid engine mentioned above, I will try it. Finally, I do need to convert "dead" PDF from time to time. And I did try LibreOffice for PDFs with a text layer and didn't like the results, judging that manually copying text and images from Adobe Reader and partly retyping would be faster in that case. ▲ Collapse | | |
Artem Vakhitov wrote: Another user in a private message suggested that I try Infix PDF Editor for PDF conversion, adding that the developers specifically ensure that it runs under Wine. Even though I feel that it is overall more like the Solid engine mentioned above, I will try it. As a beta-tester of Infix, I must warn you that it uses one of the slowest OCR engines I've ever seen (I use OmniPage). However the real value in Infix Pro for translators is its use on "distilled", aka software-generated, aka "editable" PDF files. The concept is that you'll export tagged XML or TXT (your choice) text and tag your PDF (better use a copy of it) as well. Then you'll translate that XML or TXT with your favorite CAT tool, if any. Next, you'll use Infix Pro to import that translated XML or TXT in place. The tags will ensure the right fonts, sizes, colors, places etc. Finally, Infix offers you all the DTP tools to fix the layout on the PDF after translation. Of course, this is an oversimplification of the entire process. You may see a somewhat more detailed walkthrough from an actual project I did at http://www.lamensdorf.com.br/translating-a-pdf.html . The big advantage is that you no longer must have (and know how to use) InDesign, PageMaker, FrameMaker, and QuarkXPress to serve all cients in your language pair, as long as they can provide you with a PDF. You'll also cover the amateurish MS Publisher, Serif PagePlus and Scribus. Furthermore, those pesky-layout MS Word files are easier to translate as PDFs using Infix, because Word lacks most DTP tools. The good news is that Infix costs only a fraction of one professional DTP program. | | | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » Current state of advanced PDF conversion on Linux and/or the Web? Protemos translation business management system | Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!
The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.
More info » |
| Anycount & Translation Office 3000 | Translation Office 3000
Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |