Headers and footers in OmegaT when working with a Word document that was converted from a PDF (OmegaT support)

Technical forums » OmegaT support »
Headers and footers in OmegaT when working with a Word document that was converted from a PDF
Track this topic

Pages in topic: [1 2] >

Headers and footers in OmegaT when working with a Word document that was converted from a PDF

Thread poster: Nina Halperin

Nina Halperin

Peru
Local time: 03:27
Spanish to English
+ ...

Jul 21, 2020

Hello,

I am doing a practice project in OmegaT 4.3.2, in which I am translating a document that I had converted from a PDF into a Word document using Wordfast Anywhere. The formatting in the resulting Word document was basically conserved in its entirety; however, I did not notice at first that the page numbers in the bottom right corner of every page are considered part of the body of the document instead of as footers. Consequently, the number "1" was considered as a separate segment that cut right in the middle of a sentence. That sentence was therefore split into two segments, divided by a segment consisting only of the number "1."

In OmegaT, when translating Word documents that have been converted from PDFs, do I need to manually put all the text that should be in headers or footers into the header/footer sections of the resulting Word document before uploading the document into OmegaT? Because I did not realize the discrepancy in time while working on this project, I ended up editing the source document within OmegaT and taking out the page numbers altogether, with the intention of putting them in manually in the final document after extracting it from OmegaT. The sentence I mentioned that had been divided by the number "1" was still divided, and when I tried to create a "Quick-fix Rule" under the segmentation section to join the two segments, it did not work. The segments in question are:
"George and his siblings spent a few months in foster care before his grandmother and"
"her husband [t0/]were[t1/] able to take custody of them."
I had added a quick-fix rule of:
Pattern before: "and"
Pattern after: " her" (with a space before "her")
"Break/exception" was not checked. Am I doing something wrong?

I then tried editing the source document within OmegaT and manually putting the page numbers within the footer. However, that seemed to affect the structure of the rest of the document: there are now various sentences that are divided into multiple segments in OmegaT, when they should be just one segment each. Is the best course of action in situations like this to manually fix the headers and footers in the converted Word document before uploading it into OmegaT? Thank you!

[Edited at 2020-07-21 21:00 GMT] ▲ Collapse

esperantisto

Local time: 11:27
Member (2006)
English to Russian
+ ...

SITE LOCALIZER

Yes, you are

Jul 22, 2020

Nina Halperin wrote:

Am I doing something wrong?

Yes. The wrong part is that you do not provide the actual file. Show it.

Susan Welsh

United States
Local time: 04:27
Russian to English
+ ...

OCR quality?

Jul 22, 2020

See https://www.proz.com/forum/omegat_support/345230-trouble_with_charts_in_omegat.html#2860256

I also suggest that you ask questions on the developer list, which has more traffic than ... See more

See https://www.proz.com/forum/omegat_support/345230-trouble_with_charts_in_omegat.html#2860256

I also suggest that you ask questions on the developer list, which has more traffic than here:
https://sourceforge.net/projects/omegat/lists/omegat-development ▲ Collapse

Jean Dimitriadis

English to French
+ ...

Garbage in, garbage out

Jul 22, 2020

This sounds like a problem of source document preparation.

Instead of quick fix rules, I would make sure the source Word file is formatted appropriately (including line/paragraph breaks, headers and footers, etc.). Since you have already worked on the translation, you can reimport the document to the existing project.

Samuel Murray

Netherlands
Local time: 10:27
Member (2006)
English to Afrikaans
+ ...

@Nina

Jul 22, 2020

Nina Halperin wrote:
The segments in question are:
"George and his siblings spent a few months in foster care before his grandmother and"
"her husband [t0/]were[t1/] able to take custody of them."

I don't think this is a header/footer problem, but a problem in that you have a line break between "and" and "her". OmegaT's segmentation settings can't fix an errant line break. You need to make sure that you have no unnecessary line breaks in your source file.

Nina Halperin wrote:
I did not notice at first that the page numbers in the bottom right corner of every page are considered part of the body of the document instead of as footers. Consequently, the number "1" was considered as a separate segment that cut right in the middle of a sentence. That sentence was therefore split into two segments, divided by a segment consisting only of the number "1."

Yes, that is what you can expect to happen if there is a line break, a number, and another line break, in the middle of a sentence. If you fix the document before loading it into OmegaT, by e.g. creating real headers and footers, and making sure that there are no line breaks in the middle of the sentence that used to be at the end and start of the two pages, then OmegaT will handle it correctly.

A PDF converter does 99% of the work for you, but you still have to fix the document before loading it into the CAT tool (and this applies to all CAT tools that I know of).

[Edited at 2020-07-22 18:25 GMT]

Nina Halperin

Peru
Local time: 03:27
Spanish to English
+ ...

TOPIC STARTER

Trouble getting rid of line breaks at the bottom of the page when there are section breaks

Jul 23, 2020

Thank you everyone for your replies! Esperantisto, I must admit I am unsure what you are referring to. Susan, I did see your reply on my other post, thank you. But as I do not have ABBYY Finereader, I'm unsure if I'm able to carry out your suggestions. For example, what do you mean "Mark footnotes...on the hard copy"? Do you mean that I have to manually insert the footnotes in the source document as Jean and Samuel have suggested?

Samuel, like I just mentioned in my other post that you replied to, I enabled the paragraph sign button in the source document within OmegaT and noticed that there was a paragraph sign at the bottom of every page, probably due to the presence of section breaks. I wasn't sure if I could delete the paragraph signs while leaving the section breaks, but I tried. In some cases, it did allow me to delete the paragraph sign, but when I saved the file and reloaded the project in OmegaT, the sentence in question was still divided in two segments within OmegaT, which made it seem like the line break had actually remained somehow. In other cases, when I deleted the paragraph sign at the end of the page, that section of text got grayed out for some reason. I ended up undoing that action as a result.

Then I opened the original translated document that I had completed some time ago without any CAT tools (I was just trying to translate sections of it again on OmegaT for practice). There were no section breaks in my original translated text. I suppose that Wordfast added them in for some unknown reason in the conversion process. In this particular case, I suppose I could just take out the section breaks. However, I would still like to know how to deal with removing line breaks when there's a section break at the end of every page, because it could definitely come up in the future. Sometimes I've had to insert section breaks at the end of every page when there's changing information in every footer, for example, if it says "Page X/20" as opposed to just the page number by itself.

After I realized that the section breaks were not in my original translation, I tried eliminating them in the source document within OmegaT to see if that would do anything. Some section breaks cut in the middle of charts and could not be eliminated; however, I eliminated all the other ones. The problematic section of text that I mentioned before had been fixed. However, a small part of the formatting of the source document got messed up. There had been a section that said: "Hearing Screening:", "Vision Screening:", "Medication Regimen:", etc., with information directly to the right of each heading. All the information that had been to the right got pushed down below all the headings. It is only something small that could be changed easily, but I don't know why just taking out a section break would have done that. In any case, if someone could help me understand how to take out line breaks when there are section breaks, I would really appreciate it.

Thanks again everyone! ▲ Collapse

Samuel Murray

Netherlands
Local time: 10:27
Member (2006)
English to Afrikaans
+ ...

@Nina

Jul 23, 2020

Nina Halperin wrote:
I wasn't sure if I could delete the paragraph signs while leaving the section breaks...

OmegaT will also segment at section breaks.

In other cases, when I deleted the paragraph sign at the end of the page, that section of text got grayed out for some reason. I ended up undoing that action as a result.

An OCR program will try to make use of all the available features in Word -- even features that normal people almost never use or never encounter. This means that sometimes you need to be an expert-expert at using Microsoft Word in order to fix the OCR'ed document. Either way, knowing how to use Microsoft Word generally is a requirement anyway.

There had been a section that said: "Hearing Screening:", "Vision Screening:", "Medication Regimen:", etc., with information directly to the right of each heading. All the information that had been to the right got pushed down below all the headings.

What you're describing is a normal problem with Microsoft Word and trying to create very specific layouts in Word. Word was not originally designed with layout features in mind, so it's capabilities in this regard tend to be prone to failure. This is a case, then, where you would have to fix the Word file at the very end of the translation process -- but that requires being an expert at using Word.

This is a danger when working on documents that were not originally designed with translation in mind. The less suited a document is for translation, the more work the translator will have to do at the end of the translation process.

[Edited at 2020-07-23 07:42 GMT]

esperantisto

Local time: 11:27
Member (2006)
English to Russian
+ ...

SITE LOCALIZER

File

Jul 23, 2020

Nina Halperin wrote:

Esperantisto, I must admit I am unsure what you are referring to.

I mean the file. Show it. Share it. Upload it to somewhere. There can be various reasons for undesired breaks in segments. Unless you show the file, you remain the only one who can see it.

I have to manually insert the footnotes…

With FineReader 11 and later (not sure about earlier versions), you can mark blocks as page footers / headers.

In any case, if someone could help me understand how to take out line breaks when there are section breaks…

I’m not sure that I understand the situation (once again, show the file!), but here is a StarBasic macro for OpenOffice that might help:

https://pastebin.com/jxhdVKzb

Nina Halperin

Peru
Local time: 03:27
Spanish to English
+ ...

TOPIC STARTER

Is it necessary to take out all the section breaks of the document before uploading it to OmegaT?

Jul 23, 2020

Samuel, if OmegaT segments at section breaks, is it then necessary to take out all the section breaks of the document before loading it into OmegaT? If the section breaks were actually necessary in the target document (in this case they were not), I suppose I could reinsert them after exporting the target document from OmegaT. Then I would have to edit the headers and footers manually. Am I understanding this correctly? Because I've always recreated the formatting of PDFs on my own in Word, I would say I am at the advanced end of intermediate in terms of using Word, but I'm not an expert. You also said I would have to fix the Word document at the very end of the translation process, but couldn't I just recreate the chart by hand with the correct formatting before loading it into OmegaT?

Esperantisto, oh I see, I was unaware that we can upload documents into a ProZ forum. I'm still not sure how to do that. In any case, I would not be able to upload the file for confidentiality reasons: it is a psycho-educational evaluation report for a student. Even in the one-line example I gave, I changed the name of the student. I opened the link you gave me and clicked on "download," but all that came up was a text box that says:
"Sub DeleteAutoPageBreaks
oDoc = ThisComponent
oStyles = oDoc.getStyleFamilies
oPageStyles = oStyles.getByName("PageStyles")
oPageStyleNames() = oPageStyles.getElementNames()
For i = 0 to uBound(oPageStyleNames)
thisName = oPageStyleNames(i)
oPageStyles.removeByName(thisName)
Next i
oCursor = oDoc.Text.CreateTextCursor()
oCursor.GoToStart(False)
Do
If oCursor.BreakType com.sun.star.style.BreakType.PAGE_BEFORE Then
If NOT IsEmpty(oCursor.PageDescName) Then
oCursor.PageDescName = ""
oCursor.BreakType = com.sun.star.style.BreakType.PAGE_BEFORE
End If
End If
Loop Until NOT oCursor.gotoNextParagraph(False)
Msgbox "Farite"
End Sub"

I wonder if this macro doesn't work on Macs? I don't know anything about macros so maybe that's how it's supposed to work...

Thanks to both of you! ▲ Collapse

Samuel Murray

Netherlands
Local time: 10:27
Member (2006)
English to Afrikaans
+ ...

@Nina

Jul 23, 2020

Nina Halperin wrote:
You also said I would have to fix the Word document at the very end of the translation process, but couldn't I just recreate the chart by hand with the correct formatting before loading it into OmegaT?

Yes, either fix it before loading it into OmegaT, or fix it after you're done with OmegaT.

Nina Halperin wrote:
If OmegaT segments at section breaks, is it then necessary to take out all the section breaks of the document before loading it into OmegaT?

It is up to you what you do with section breaks. If they split a sentence in the middle, perhaps you can move the one half of the sentence across the section break so that it joins the rest of the sentence.

Nina Halperin

Peru
Local time: 03:27
Spanish to English
+ ...

TOPIC STARTER

Thanks

Jul 23, 2020

That could be a good option, thanks Samuel!

esperantisto

Local time: 11:27
Member (2006)
English to Russian
+ ...

SITE LOCALIZER

Macro

Jul 24, 2020

Nina Halperin wrote:

I opened the link you gave me and clicked on "download," but all that came up was a text box that says…
…
I wonder if this macro doesn't work on Macs? I don't know anything about macros so maybe that's how it's supposed to work...

What you see is a piece of code that you need to use in LibreOffice. It should work on Macs as it has no OS-specific feature.

Start reading from here:

https://help.libreoffice.org/6.4/en-US/text/shared/01/06130000.html

Nina Halperin

Peru
Local time: 03:27
Spanish to English
+ ...

TOPIC STARTER

Running macros in LibreOffice

Jul 25, 2020

Thank you, esperantisto. I downloaded LibreOffice and opened the corresponding document within it. Then I went under Tools, Macros, and Run Macros. However, it seems like I needed to have already uploaded the macro into the system for it to show up under this section. Then I went to Tools, Macros, and Edit Macros. Using this part of the manual you sent me:
"Creates a new macro, creates a new module or deletes the selected macro or selected module.
To create a new macro in your document, select the "Standard" module in the Macro from list, and then click New"
I clicked on Standard and then New at the top of the page. It brought up a blank page. Then I tried copying and pasting the text of the macro into that page and saving it. I have no idea if that's what I'm supposed to do. Afterwards I went back into Run Macros and didn't see the macro I had just saved. Again, I have absolutely no experience with macros so I'm pretty sure what I did was completely wrong. Any suggestions? Thanks! ▲ Collapse

esperantisto

Local time: 11:27
Member (2006)
English to Russian
+ ...

SITE LOCALIZER

Almost done

Jul 27, 2020

Nina Halperin wrote:
I'm pretty sure what I did was completely wrong.

Actually, what you did was almost completely correct, good job

You missed only a final step. When you open the Standard library, you should see Module1, the default container for your macros. Open it to see a dummy piece of code in the right-hand editor window, something like:

Code:

REM  *****  BASIC  *****



Sub Main



End Sub

Paste the code that I provided there (you can delete the above dummy code, it has no use per se). Close the macro editor window or click on Save and switch to your document.

Now, go to Tools → Macros → Run macro → My macros → Standard → Module1 → DeleteAutoPageBreaks, click Run.

[Адрэдагавана 2020-07-27 07:31 GMT]

Nina Halperin

Peru
Local time: 03:27
Spanish to English
+ ...

TOPIC STARTER

@Esperantisto

Jul 27, 2020

Esperantisto, thank you so much for the explanation. I did what you said, and afterwards I got a message saying "Farite" in LibreOffice. I'm not sure what that means; maybe there was an error in the process. Originally when I tried to run the macro, I had also gotten a message that said I needed to download another software in order to run it, but when I tried again it seemed to have worked, except that I got that "Farite" message.

When I saved the document in LibreOffice, it said I had to save it as an ODT in order to preserve the formatting. Then I uploaded the ODT document in OmegaT, and the sentence I mentioned is still divided into two segments. I then tried to open the ODT source document within OmegaT to see if there was still a section break in that part of the document, but it said there was an error in opening the file. ▲ Collapse

Pages in topic: [1 2] >

Login to reply/comment

There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »

Headers and footers in OmegaT when working with a Word document that was converted from a PDF

Forum rules

Help and orientation

Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers! The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc. More info »

CafeTran Espresso
You've never met a CAT tool this clever! Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free Buy now! »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Headers and footers in OmegaT when working with a Word document that was converted from a PDF

Headers and footers in OmegaT when working with a Word document that was converted from a PDF

You have native languages that can be verified

Your current localization setting

Select a language