Segmentation rule to deal with quote marks
Thread poster: Thijs Vissia
Thijs Vissia
Thijs Vissia
Netherlands
Jan 8, 2020

I'm translating from English to Dutch, and I notice that when a sentence ends on a quote mark (either curly quotes or straight), with the period sitting between quotes right before the end quote (i.e. "This is a sentence." or “This is a sentence.”), the segmentation script considers it a single segment, even though the quote ends. (Instead of a situation where multiple sentences are quoted, where I might want the quote to remain in a single segment.)

I was wondering if anyone co
... See more
I'm translating from English to Dutch, and I notice that when a sentence ends on a quote mark (either curly quotes or straight), with the period sitting between quotes right before the end quote (i.e. "This is a sentence." or “This is a sentence.”), the segmentation script considers it a single segment, even though the quote ends. (Instead of a situation where multiple sentences are quoted, where I might want the quote to remain in a single segment.)

I was wondering if anyone could help me to figure out the break or exception rule for this, so that a segment separates after the quote, instead of after a full stop.

I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect.

If there's anywhere that offers more guidance about how to construct the rules, I'd also be interested in that, the manual is a bit sparse I thought.

I'm also not sure how to distinguish between a break rule and an exception (no break) rule, there's a checkmark in the Segmentation dialog but I'm not sure what the checkmark means - if it doesn't mean "select this rule for further operations". How does the dialog allow me to set a break or a no break rule?

Thanks for any help!
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 15:09
Member (2006)
English to Afrikaans
+ ...
@Thijs Jan 8, 2020

I suggest you re-ask your question here:
https://sourceforge.net/projects/omegat/lists/omegat-users
...since segmentation rules are perhaps more geeky than most other issues.


 
esperantisto
esperantisto  Identity Verified
Local time: 17:09
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
Show it! Jan 8, 2020

Thijs Vissia wrote:


I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect.


For starters, ", and are three different symbols, thus \.\" won’t work for .

In order to understand why your attempts failed, share:


  • a short sample file/sample project;

  • your segmentation rules (i. e. your segmentation.conf).



Thijs Vissia wrote:
there's a checkmark in the Segmentation dialog but I'm not sure what the checkmark means - if it doesn't mean "select this rule for further operations". How does the dialog allow me to set a break or a no break rule?


Every rule set in the dialog is applied. The checkmark when ticked means that the rule makes a break. If not ticked, the rule is a joiner.

[Edited at 2020-01-08 14:37 GMT]


 
tcordonniery
tcordonniery
France
Local time: 15:09
Break rules and exceptions Jan 8, 2020

Thijs Vissia wrote:
I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect.


That is correct, except that, as others said, \" does not cover character ”
Try with \.[\"”] instead
Ensure also that this rule appears before the rule with before = \. and after = \s ; Rules order is important because the segmenter will apply them in the order they appear, and once a rule affects a location in your phrase, no rule can affect the same location anymore.

Other option I would test: before = \. and after = [\"”], but then it is an exception, not a rule. See why in the following.

Thijs Vissia wrote:
I'm also not sure how to distinguish between a break rule and an exception (no break) rule,


An exception means that we do not want to cut, even if the rules which follow the given one say that we should.
Example:
1. Normally after a dot, we want to cut. This is a break rule which is usually at the end of the rules set.
2. However, after "Mr." you should not cut because this is an abbreviation
(example: Mr. Smith said that... ==> if we only apply rule 1, the segmenter will cut aftert the dot; the exception prevents that, but only if exception is declared before the rule)

esperantisto wrote:
Every rule set in the dialog is applied. The checkmark when ticked means that the rule makes a break. If not ticked, the rule is a joiner.


Really? Contrary of a break rule is an exception (a location where you do not want to break). I don't see what a joiner is, since segmentation rules are only used to cut segments (when you want to join them, the rule seems to be always using spaces between joined segments).


 
Thijs Vissia
Thijs Vissia
Netherlands
TOPIC STARTER
Thank you Jan 8, 2020

Thanks for your responses.

I later gathered that the checkmark in front of each rule would mean an exception, or a "no-break rule". But I'll look into that further and look at the manual again, I'm not as focussed right now.

tcordonniery,
Thanks for your rewritten rule, that does appear to do the job. In my current (in fact, the default) English language segmentation rules, there doesn't seem to be a rule that this one should appear before, as you write ("the rule
... See more
Thanks for your responses.

I later gathered that the checkmark in front of each rule would mean an exception, or a "no-break rule". But I'll look into that further and look at the manual again, I'm not as focussed right now.

tcordonniery,
Thanks for your rewritten rule, that does appear to do the job. In my current (in fact, the default) English language segmentation rules, there doesn't seem to be a rule that this one should appear before, as you write ("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.)

In any case, the segment breaks now appear where I want them, hoping it doesn't cause any adverse results elsewhere.

Many thanks for the help, all!

Oh, another related question I just thought of:

I supposed that these segmentation rules are set for OmegaT as a whole, and not per project. How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them? Does that mean I lose the target segments (when I open the project, or when I save it)? Or not?
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 17:09
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
Project-specific rules Jan 9, 2020

Thijs Vissia wrote:

("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.)


It is a rule for very many languages. Thus, it is not language-specific and can be found under the Default rule set.

Thijs Vissia wrote:

How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them?


Two options basically:

1. (Simple) Create a project-specific rule set: Project properties → Segmentation → Make segmentation rules project-specific.
2. (Not so simple) Create a separate user profile and start OmegaT to use it when you want to work with specific projects: 7. Starting OmegaT from the command line .


 
Thijs Vissia
Thijs Vissia
Netherlands
TOPIC STARTER
Cheers Jan 9, 2020

esperantisto wrote:

Thijs Vissia wrote:

("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.)


It is a rule for very many languages. Thus, it is not language-specific and can be found under the Default rule set.

Thijs Vissia wrote:

How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them?


Two options basically:

1. (Simple) Create a project-specific rule set: Project properties → Segmentation → Make segmentation rules project-specific.
2. (Not so simple) Create a separate user profile and start OmegaT to use it when you want to work with specific projects: 7. Starting OmegaT from the command line .


Great, thank you very much.

(The first part of your answer I figured out after posting, it's there as before:"\.\?\!" and after:"\s" in the Default rules, if I'm not mistaken.)


 
Andrey Raugas
Andrey Raugas
Georgia
Local time: 18:09
English to Russian
Using Java and Unicode in segmentation rules Feb 1, 2020

Hi. I don't know if this question is still relevant, nevertheless here are my findings on segmentation rules. The rules use Java regular expressions. Here is a tutorial:
https://docs.oracle.com/javase/tutorial/essential/regex/index.html

Great feature of Java regular expressions is support of Unicode character categories. Here is a list of them:
<
... See more
Hi. I don't know if this question is still relevant, nevertheless here are my findings on segmentation rules. The rules use Java regular expressions. Here is a tutorial:
https://docs.oracle.com/javase/tutorial/essential/regex/index.html

Great feature of Java regular expressions is support of Unicode character categories. Here is a list of them:
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category

Using that information I constructed following segmentation rule in order to deal with quotation marks, brackets and tags at the beginning and at the end of English sentences.

Add this as checked line (break) at the bottom of "Default" rules section:
before: [\.\?\!]+[\p{Pf}\p{Pe}(<[\w/]+>)]*
after: \s+[\p{Ps}\p{Pi}(<[\w/]+>)]*\p{Lu}

Some explanations:
before:

  • [\.\?\!]+ means that any of these punctuation marks (or their combination) occurs one or more times
  • \p{Pf} is closing quotation mark
  • \p{Pe} is closing bracket
  • <[\w/]+> is a tag, i.e. some combination of slash (/), letters and digits (\w) in angle brackets
  • [\p{Pf}\p{Pe}(<[\w/]+>)]* means that any combination of closing quotation mark, closing bracket and a tag occurs zero or more times


and so on.

[Edited at 2020-02-01 09:22 GMT]
Collapse


 
Thijs Vissia
Thijs Vissia
Netherlands
TOPIC STARTER
many thanks Feb 5, 2020

Andrey Raugas wrote:

Hi. I don't know if this question is still relevant, nevertheless here are my findings on segmentation rules. The rules use Java regular expressions. Here is a tutorial:
https://docs.oracle.com/javase/tutorial/essential/regex/index.html

Great feature of Java regular expressions is support of Unicode character categories. Here is a list of them:
https://en.wikipedia.org/wiki/Unicode_character_property#General_Category


hi Andrey,

Pardon my late reply, but thank you for pointing this out, very good to know. And thanks also for taking the trouble of constructing an expression to deal with this, I will give it a try.

best,
Thijs


 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Segmentation rule to deal with quote marks






Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »