Analysis question
Thread poster: JohnWhi
JohnWhi
JohnWhi
United Kingdom
Local time: 23:08
French to English
Aug 18, 2021

Not a problem, just trying to understand. A few weeks ago, I did a French to English translation of a 12-page document. I ran Analysis before starting and it showed 7% repetitions. Some way through, I noticed that pages 7-12 were effectively a repeat of pages 1-6, with occasional minor differences in wording and format. I might have expected everything to come through from TM after p7, but this only happened occasionally. Looking back at the Wordfast file, I see that the segmentation was rather ... See more
Not a problem, just trying to understand. A few weeks ago, I did a French to English translation of a 12-page document. I ran Analysis before starting and it showed 7% repetitions. Some way through, I noticed that pages 7-12 were effectively a repeat of pages 1-6, with occasional minor differences in wording and format. I might have expected everything to come through from TM after p7, but this only happened occasionally. Looking back at the Wordfast file, I see that the segmentation was rather different on p1-6 and on p7-12, and this may explain it. I had left segmentation at the default settings. Do I need to tweak something?
Work done in Wordfast 6.3.0 (I think) on MacBook Pro running MacOS Catalina and iMac running Big Sur (latest release versions). I notice that neither can now export a report in anything other than html.
Just ran an analysis on the same file using Wordfast 6.4.0, and this brings me down to 2% repetitions!
In the past, I may have used the Analysis when billing, but see that I need to avoid taking much notice of it, unless there is something I need to change to bring it closer to the truth.
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 00:08
Member (2006)
English to Afrikaans
+ ...
@John Aug 19, 2021

JohnWhi wrote:
I see that the segmentation was rather different on p1-6 and on p7-12, and this may explain it.


Indeed, the analysis from a CAT tool is typically based on segmentation alone. I know of no CAT tool that will warn you if it detects that a section of text is "similar to" another section of text unless the individual segments that make up those sections are also highly similar.

From your post I gather that your source file is not a TXLF file but the original file format, is that correct? I can't help with why WFP would segment the second section differently. Or... did you receive a TXLF file (or an XLIFF file) from the client?


 
JohnWhi
JohnWhi
United Kingdom
Local time: 23:08
French to English
TOPIC STARTER
Thanks and explanation Aug 19, 2021

Many thanks for that, Samuel. It explains why very little came through from the TM. The agency are totally Trados-based, so I started with a Word file source, converted from the PDF they sent using Adobe PDF export. (See below for the probable answer!) As well as the wording, the layout and format of each of the two sections seem to be the same, line length, punctuation, paragraphs, images. I think I may have the answer, though. The Adobe conversion tends to add a range of unexplainable tags tha... See more
Many thanks for that, Samuel. It explains why very little came through from the TM. The agency are totally Trados-based, so I started with a Word file source, converted from the PDF they sent using Adobe PDF export. (See below for the probable answer!) As well as the wording, the layout and format of each of the two sections seem to be the same, line length, punctuation, paragraphs, images. I think I may have the answer, though. The Adobe conversion tends to add a range of unexplainable tags that might cause it to be interpreted as something different, even though visibly the same. Over the 12+ years I have used Wordfast, I have noticed occasions when I have a precise memory of translating the same words previously in the same document, it does not show in TM, and I have to locate it in Find & Replace to see what I put. In a few years, importing PDF directly to Wordfast will probably solve this but, for the time being I will stay with Adobe and Finereader OCR Pro to keep the page layout.Collapse


 
Edward Potter
Edward Potter  Identity Verified
Spain
Local time: 00:08
Member (2003)
Spanish to English
+ ...
@John Aug 20, 2021

JohnWhi wrote:

I have noticed occasions when I have a precise memory of translating the same words previously in the same document, it does not show in TM, and I have to locate it in Find & Replace to see what I put.


I find myself having to do this at times. CAT tools are not perfect.


 
JohnWhi
JohnWhi
United Kingdom
Local time: 23:08
French to English
TOPIC STARTER
Further investigation Aug 23, 2021

In odd moments over the last few days, I have carried out a few experiments that give a clearer view of the limitations.
1. Used Adobe to convert to plain text and created a test project using this. Analysis showed zero repetitions, so stopped there, though looking in a text editor showed quite a lot of peripheral formatting had been carried through as text.
2. Imported directly from PDF. This only loaded the first page, so stopped there.
3. Used OCR on the PDF to convert to pl
... See more
In odd moments over the last few days, I have carried out a few experiments that give a clearer view of the limitations.
1. Used Adobe to convert to plain text and created a test project using this. Analysis showed zero repetitions, so stopped there, though looking in a text editor showed quite a lot of peripheral formatting had been carried through as text.
2. Imported directly from PDF. This only loaded the first page, so stopped there.
3. Used OCR on the PDF to convert to plain text and created a new test project. This showed 16% repetition and there were 109 segments, compared to 259 in the completed project. Using Auto-Propagation/Auto-Suggestion brought back 58% from the memory used for the completed project.
4. Set up a new project in the current 6.4.0 version of Wordfast Pro and added the original Word file.. As first time, this suggested 7% repetition. The number of segments had increased to 265, and 35% came back from the memory through Auto-Propagation/Auto-Suggestion, with no leverage in the other segments.
CONCLUSIONS (RIGHT OR WRONG, FURTHER COMMENTS WELCOME)
1. I gather it is not possible to make segmentation more consistent. To the currently defined End of Segment Markers, .?!: , the only one I would add is ; and I would be wary of checking the other options.
2. Analysis does not say much about repetition, and I will be more wary when asked to give a quote for a job on the basis of a Trados analysis file. The only way is by sight. Seeing about 50% repetition, I reduced my fee by 20%, which I think is fair for a legal document where you have to check the meaning of every word.
3. Translation Memory has become a lot more hit and miss in its operation. My memories of running a file of different format containing a text I had previously translated through Wordfast and having nearly all auto-completed from the TM are now of the past.
(Not a complaint about Wordfast. I find it a convenient tool.)
Collapse


Philippe Locquet
 
Dragomir Kovacevic
Dragomir Kovacevic  Identity Verified
Italy
Local time: 00:08
Italian to Serbian
+ ...
some more reasong why Aug 23, 2021

PDFs converted through CATs, produce quite good word files, but sometimes you will find paragraph signs where they should not be, or they WERE there in the PDF, but should not have been. So, you'll have to pass trough the Word file visually, delete these PAR signs and reconnect phrases. Also, this work helps to comprehend better the subseguent translation work! After improving a Word file, you can pass it through an analysis, thinking very well what to afford in it, or not. Granting too much is ... See more
PDFs converted through CATs, produce quite good word files, but sometimes you will find paragraph signs where they should not be, or they WERE there in the PDF, but should not have been. So, you'll have to pass trough the Word file visually, delete these PAR signs and reconnect phrases. Also, this work helps to comprehend better the subseguent translation work! After improving a Word file, you can pass it through an analysis, thinking very well what to afford in it, or not. Granting too much is granting to the buyer; you can grant if you need to be more competitive, otherwhise, why?

Good OCR should not be automatic, but controlled; or, at least done using a template, what to do with soft returns (also in CAT analysis), white space after phrase automatic or not...

Also, in CAT, adding TAB besides usual segmentation signs, is more than needed. If you OCR a PDF, a good OCR software will create tabs where ever it finds more than one space. Most possible a good CAT as well.


[quote]JohnWhi wrote:

[Edited at 2021-08-23 18:25 GMT]
Collapse


 
JohnWhi
JohnWhi
United Kingdom
Local time: 23:08
French to English
TOPIC STARTER
Thanks and observations Aug 23, 2021

Useful suggestions, Dragomir. I take it that I am mistaken in using Wordfast (limited PDF input and, so far as I can see, ability to output Word from PDF even more limited), Adobe Export (very little control of what it does, apart from choice of output format), and Finereader OCR Pro (no possibility of using templates, whatever these may be in that context, though it copes pretty well with extracting text and placing it correctly depending on preference settings). I take it you suggest adding Ta... See more
Useful suggestions, Dragomir. I take it that I am mistaken in using Wordfast (limited PDF input and, so far as I can see, ability to output Word from PDF even more limited), Adobe Export (very little control of what it does, apart from choice of output format), and Finereader OCR Pro (no possibility of using templates, whatever these may be in that context, though it copes pretty well with extracting text and placing it correctly depending on preference settings). I take it you suggest adding Tab to the End of Segment Markers and will give it a try. Not sure about removing paragraph returns, as these affect the layout. My main reason for using Wordfast is to have output that resembles the original. Being well into retirement, I am not really looking for "good" CAT and OCR software unless it readily came to hand. I started with a mechanical typewriter, and the cheapest laptop can fulfil that role.Collapse


Philippe Locquet
 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 23:08
English to French
+ ...
Mighty interesting! Aug 23, 2021

JohnWhi wrote:



I will have to thank you for this thread, it is extremely interesting. I will keep this handy.
Just to add some hypothesis I can think of:
_SDL (RWS now) use an algorithm on top of simple matching so that makes the fuzzy score higher. They also have a slightly different method for segmentation, so when you create an xliff in Trados and Wordfast, you may even get a different segments count.
_Converts from PDF cause a variety of issues, these could be native to the version of the file prior to it being exported to PDF for the first time. Such issues can even introduce unwanted hidden HTML code that can only be found when you look at the content of a specific tag. A simple tag difference can throw off repetition scoring.
_Source style: some writers will make text fit around an image by adding returns in the middle of a sentence, which gives you separate segments (bu you probably spotted those already I suppose).

Some things to explore that may help:
In Wordfast Pro, you can use the filtering options just above the segments in the editor view. One option would be to display only repetitions (Filter/Segments with repetitions or Filter/Duplicate segments). You can also test that in Quick Tools/extract Uniques and set extract segments to at least "2" times.

Since you are using state-of-the art methods for converting, my guess is that the issues are in hidden text existing prior to the PDF being produced. The only way to kill this would be to click clear formatting in Word and then checking if the repetitions show up. But then you loose all the formatting and have to do it again. Such files are a pain. They tend to make me want to charge an hourly rate when the same client sends them in that state all the time.

Last option: if that's allowed for your documents, pre-translate the whole thing with MT. If the MT is good, you'll have all your tags where you want them and you may have less work editing on repeated segments.


 
JohnWhi
JohnWhi
United Kingdom
Local time: 23:08
French to English
TOPIC STARTER
Thanks and update Aug 25, 2021

Many thanks for the helpful suggestions, Philippe. They inspired me to find out more. Of course, everything depends on segmentation and, browsing through the completed project, I was slightly surprised to find that 147 of the 259 segments did not appear to end with one of the End of Segment Markers set in preferences. Removing Word formatting and creating a new test project was instructive. This indicated 0% repetitions but only 1% identified as "No Match" in the TM, compared with the previous 2... See more
Many thanks for the helpful suggestions, Philippe. They inspired me to find out more. Of course, everything depends on segmentation and, browsing through the completed project, I was slightly surprised to find that 147 of the 259 segments did not appear to end with one of the End of Segment Markers set in preferences. Removing Word formatting and creating a new test project was instructive. This indicated 0% repetitions but only 1% identified as "No Match" in the TM, compared with the previous 23%. I noticed that, when removing formatting, Word leaves the pilcrow paragraph marks, so my next attempt deleted those. This gave 5% repetitions and 26% "No Match". Finally, I ended up where I should have started, splitting the original pages 1-6 and 7-12 into two documents and comparing them using a free on-line text comparison site https://countwordsfree.com/comparetexts. This gave 92.08% in common between the two parts of the original, 7.92 difference, therefore about 46% repetition for the whole document.
We live and learn, though in due course I would like to find out more about segmentation and the best way of controlling it so that it is more consistent between documents.
Collapse


Philippe Locquet
 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 23:08
English to French
+ ...
SRX - REGEX Aug 25, 2021

JohnWhi wrote:
We live and learn, though in due course I would like to find out more about segmentation and the best way of controlling it so that it is more consistent between documents.


Segmentation is difficult -for everyone- and has been for a long time. In Pro, you can play with the segmentation rules and see the impact on a paragraph you can copy-paste, it's nicely interactive.

If you want to go beyond this, then you need to dive into REGEX. I know in SDL you can set up REGEX rules in some places. The easiest way would probably be to upload an SRX file. Wordfast Pro does not support upload of SRX files for segmentation but Wordfast Anywhere does. So, it's possible to get an SRX file to fine-tune segmentation to one's needs, but I doubt this could perfectly fix issues coming from file conversion (if anyone knows the subject please correct me here, that would be very useful).
I would personally lean towards a solution being in adding a layer on top of the standard xml filtering while converting a file to xliff, but then again, not easy either.

If you find anything interesting, please share


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Analysis question







Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »