CAT tool with easily adjustable segmentation rules
Thread poster: Csaba Lehel
Csaba Lehel
Csaba Lehel
Hungary
English to Hungarian
Jun 8, 2019

Hi,
I would need your advice. I work on a long term volunteer subtitling project (now for more than 1,5 years). So far I used only MS Word, now look for a CAT tool. Translation is normally done in a doc or docx table, left column for source, right for target. There is one special problem: since this will be subtitle, an average sentence is split into a few cells (to make sure that the subtitle would not be too long, the text has to fit into just one line within a cell). For most CAT softwa
... See more
Hi,
I would need your advice. I work on a long term volunteer subtitling project (now for more than 1,5 years). So far I used only MS Word, now look for a CAT tool. Translation is normally done in a doc or docx table, left column for source, right for target. There is one special problem: since this will be subtitle, an average sentence is split into a few cells (to make sure that the subtitle would not be too long, the text has to fit into just one line within a cell). For most CAT software every cell of a table is considered a new sentence. Is there an easy way to set up a CAT with segmentation rules to see a sentence spread across like 3-5 cells to be still a sentence? Of course there are ways to modify the files, but there are many of them, and it would be much easier to teach a CAT once and for all that a sentence is still a sentence, even if it is spread across many cells. Of course it would also help, if the CAT software is easy to use, free, etc. But the important point it has to function.

Thanks for your help in advance!
Csaba
Collapse


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Clipboard Workflow Jun 9, 2019

CafeTran Espresso 10 Croissant offers the Clipboard Workflow that allows you to set segmentation to parts of table cells or to multiple, adjacent table cells:

https://cafetran.freshdesk.com/support/solutions/folders/6000058167

Stage 1: Translating

In MS Word you define the translation unit (segment), e.g. by selecting three table cells:

0

After you have pressed CTRL+C (CMD+C), your selection will be presented as one translation unit in CafeTran Espresso 10 Croissant:

1

Once you have translated your translation unit, press NEXT and switch to MS Word. Paste your translation via CTRL+V (CMD+V) into the first of the three matching cells of the second column:

2

Stage 2: Post-segmenting

Run a macro*) that matches the segments of your translation to the corresponding source segments:

3

Questions

Will the source and target always have the same number of rows? Or are situations like these possible:

2-3

And:

3-2

The post-segmentation macro

Here a quick draw of the post segmentation macro in pseudo code:

  • For every cell in column 2 of table 1 in the current document, check for presence of any new line characters.
  • If a new line character is found, select the cell content from the first new line character up to the end of the cell content and cut.
  • Go to the next cell in column 2, paste the clipboard content.
  • Repeat from step 1.



[Edited at 2019-06-09 08:32 GMT]


 
Csaba Lehel
Csaba Lehel
Hungary
English to Hungarian
TOPIC STARTER
Re: Clipboard Workflow Jun 9, 2019

Thank you, I will check that! But as I see, it requires switching between the two applications for every sentence. It would be better to have a solution that can handle the whole source text in the file. Or is it possible to transfer this way in one step more than one sentence from Word to CAT, and CAT to see them properly?

 
Csaba Lehel
Csaba Lehel
Hungary
English to Hungarian
TOPIC STARTER
Structure based segmentation rules Jun 9, 2019

I read in OmegaT's manual that there are structure based and sentence based segmentation rules. I guess "end of cell = end of sentence" is some kind of structure based segmentation rule. This should be changed, if possible, and this would solve the whole problem. But this how far I got. In OmegaT I can't even understand the rules, not to mention trying to modify them, or add a new one. But there can be other solutions as well, of course.

 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Join or Split Jun 9, 2019

Csaba Lehel wrote:

Thank you, I will check that! But as I see, it requires switching between the two applications for every sentence. It would be better to have a solution that can handle the whole source text in the file. Or is it possible to transfer this way in one step more than one sentence from Word to CAT, and CAT to see them properly?


If the segmentation in the first column doesn’t need to be kept, you can just use Join and Split in nearly every CAT tool.


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 19:14
Member (2006)
English to Afrikaans
+ ...
@Csaba Jun 9, 2019

Csaba Lehel wrote:
Translation is normally done in a DOC table, left column for source, right for target. There is one special problem: since this will be subtitle, an average sentence is split into a few cells.


OmegaT does have a function whereby it ignores line breaks in TXT files. So what you could do, is this:

1. Copy/paste the source column into a new Word file.
2. Convert the table to text.
3. Add <#> at the end of every line (i.e. find ^p, replace with <#>^p).
4. Save as plain text, i.e. TXT in Unicode or Unicode 8.

In OmegaT:

1. Create a new project and add the TXT file to it.
2. In OmegaT, go Options > File Filters. Select "Text" and click the Options button. Set the "Segment source text into paragraphs on" setting to "Never".
3. In OmegaT, go Options > Preferences > Tag Processing. In the field called "Regular expression for custom tags", type <#> (or: if there is already something in that field, add |<#> to it).

Then do the translation. To create a line break, press Shift+Enter. Make sure every segment that has <#> in the source also has it in the target (an easy way to insert it is using Ctrl+Space a few times). You can check if you've forgotten any, by using Tools > Check issues at any time, and when you create the final file. If you keep forgetting that "Enter" is for moving to a new segment, you can disable it in Options > Preferences > General (use TAB to advance), but you'll still have to use Shift+Enter to insert a new line.

The reason for the <#> is to help you to check the final file to make sure it has the same number of lines as the original one, and that the cells are likely to match up when you paste it into MS Word in the end. I mean, if you're happy that you don't need to use the <#>, then you don't have to, obviously.

One downside to this method is that it's very difficult in OmegaT to merge or split a segment, so if you have more than one sentence in a single cell, OmegaT will show them as separate segments (which may not be a problem, but it's worth keeping in mind).



[Edited at 2019-06-09 11:59 GMT]


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Another approach Jun 9, 2019

Csaba Lehel wrote:

Thank you, I will check that! But as I see, it requires switching between the two applications for every sentence. It would be better to have a solution that can handle the whole source text in the file. Or is it possible to transfer this way in one step more than one sentence from Word to CAT, and CAT to see them properly?


If the segmentation in the first column doesn’t need to be kept, you can just use Join and Split in nearly every CAT tool.

If the segmentation of the source needs to be kept, this can be done too:

https://youtu.be/Vq9OrZO7odI

(Just convert the TM to a 2-column MS Word document afterwards.)


 
Csaba Lehel
Csaba Lehel
Hungary
English to Hungarian
TOPIC STARTER
Thanks for both of you! Jun 9, 2019

To Hans: Thank you for making the video! The segmentation of source has to be kept, and the target has also to be a matching number of segments. I see it is possible to do it in CafeTran. At the moment I have just OmegaT, that has no similar to join or split. Checked Wordfast Anywhere, that has Expand and Shrink, which looks similar.

To Samuel: thanks, I tried reproducing what you wrote, and I think it worked. I like
... See more
To Hans: Thank you for making the video! The segmentation of source has to be kept, and the target has also to be a matching number of segments. I see it is possible to do it in CafeTran. At the moment I have just OmegaT, that has no similar to join or split. Checked Wordfast Anywhere, that has Expand and Shrink, which looks similar.

To Samuel: thanks, I tried reproducing what you wrote, and I think it worked. I like this approach better, because it works with the whole file, and works even in OmegaT without join and split. If I can't match target number segments to source in OmegaT perfectly, no problem, it is quite easy to correct that in Word as well.

I still think changing the segmentation rules would be the simplest way, but the only thing I discovered is that this hidden character at the end of cells is called either 'end of cell mark', or 'end of row mark' (or marker). How to tell a CAT not to end the sentence there, I have no idea. I found Okapi's Ratel doing these kind of SRX files, but I am completely lost at this point.

Thank you, really, for both of you!!!
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 19:14
Member (2006)
English to Afrikaans
+ ...
@Csaba Jun 9, 2019

Csaba Lehel wrote:
I still think changing the segmentation rules would be the simplest way, but the only thing I discovered is that this hidden character at the end of cells is called either 'end of cell mark', or 'end of row mark' (or marker).


I think block-level (i.e. paragraph or cell) segmentation is pretty much hard-coded in most CAT tools. You can only adjust the segmentation rules within those block-level elements, and not create a segmentation rule that causes the segment to span across more than one block-level element. I know of no CAT tool that can do this with Word or Excel files.

Expand/merge and shrink/split in most CAT tools won't allow you to merge across block-level boundaries either. The purpose of the merge/split feature in a CAT tool is mostly to help fix it when the CAT tool didn't guess correctly how to segment a piece of text.

At the moment I have just OmegaT, that has nothing similar to join or split.


Yes, no, OmegaT doesn't have a built-in merge and split feature yet. To merge or split in OmegaT, you have to edit the segmentation rules, which are quite complicated. You can merge and split using a script in OmegaT, though (just ask in the OmegaT forum where to get it).
But anyway, OmegaT doesn't allow you to set segmentation rules that will expand segments beyond the paragraph/cell boundary, so a merge/split feature won't help you anyway.

The best current OmegaT solution is to edit the source text and then press F5 (reload).

[Edited at 2019-06-09 19:16 GMT]


 
Jean Dimitriadis
Jean Dimitriadis  Identity Verified
English to French
+ ...
And now for something completely different Jun 9, 2019

It may sound counter-intuitive, but if you are translating subtitles, why not use a subtitling software instead?

The MS Word tables workflow you describe does not sound very efficient, although I don't know the specifics (like at which point the subtitles are synchronized, are they synced by somebody else? What happens to the word tables? etc.)

If you use a subtitling program, you can have stuff such as CPL (characters per line, with warnings in case a line exceeds the
... See more
It may sound counter-intuitive, but if you are translating subtitles, why not use a subtitling software instead?

The MS Word tables workflow you describe does not sound very efficient, although I don't know the specifics (like at which point the subtitles are synchronized, are they synced by somebody else? What happens to the word tables? etc.)

If you use a subtitling program, you can have stuff such as CPL (characters per line, with warnings in case a line exceeds the specified limit), CPS (characters per second), WPM (words per minute), subtitle duration, as well as the ability to review the audio and (p)review the video if needed, merge or split subtitles, tweak the timing, etc. all which are useful when subtitling.

And of course, line breaks are no issue at all.

Translating subtitles is still subtitling in my book, not merely translation. It is an activity that often requires a different approach and specific strategies/techniques.

I don't know your actual workflow, but I suggest you take a step back and review the whole process.

Here's how I would tackle such a task:

- Transcript the original audio and (then) time code it (roughly at least, or more precisely, if the original language subtitles will be used in production and for translating into -multiple- other languages).
- Translate the subtitles using a subtitling software.


If you must use a CAT tool, some CAT tools handle subtitling formats, but unless they offer information such as CPL, CPS, subtitling duration, video/subtitles preview, etc. I'd say subtitling in a CAT tool is an exercise in futility. This is just my personal opinion.

That said, SDL Trados 2019 now offers a plugin which seems very promising, as it offers most of the capabilities one would require, plus some added QA features.

But for voluntary work, unless you or your team already own SDL Trados 2019, that might represent quite an investment.

Especially if you are working with a team, why not use a specialized gratis application for subtitling?

The cross-platform free-llbre software Aegisub comes to mind, but its development has ceased, so I don't know if will be a solution on the long run. It even has a "Translation" module, but you can just translate with its main interface as well.

Anyway, this is just a (different) POV.

Jean

[Edited at 2019-06-09 20:50 GMT]
Collapse


Silvia Pellacani
 
Csaba Lehel
Csaba Lehel
Hungary
English to Hungarian
TOPIC STARTER
This is different subtitling Jun 9, 2019

Hi Jean,
I am just a member of a huge team doing subtitling of the same TV shows in more than 20 languages. The people who can have an overview of the whole project decided this format, I can do nothing just follow it. I do not even know how they come up with timecode, how they handle subs in 20+ languages, nothing, really. There is quite a lot of repetition, similar phrases over and over, so that is why I try CAT.


 
Heinrich Pesch
Heinrich Pesch  Identity Verified
Finland
Local time: 20:14
Member (2003)
Finnish to German
+ ...
Volunteer subtitling project Jun 10, 2019

Those volunteers are the people that have brought rates for subtitling professionals down. Maybe there is software from volunteering software designers that would solve your problem.

Jan Truper
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

CAT tool with easily adjustable segmentation rules







TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »