Segmentation rule to deal with quote marks Thread poster: Thijs Vissia
|
I'm translating from English to Dutch, and I notice that when a sentence ends on a quote mark (either curly quotes or straight), with the period sitting between quotes right before the end quote (i.e. "This is a sentence." or “This is a sentence.”), the segmentation script considers it a single segment, even though the quote ends. (Instead of a situation where multiple sentences are quoted, where I might want the quote to remain in a single segment.) I was wondering if anyone co... See more I'm translating from English to Dutch, and I notice that when a sentence ends on a quote mark (either curly quotes or straight), with the period sitting between quotes right before the end quote (i.e. "This is a sentence." or “This is a sentence.”), the segmentation script considers it a single segment, even though the quote ends. (Instead of a situation where multiple sentences are quoted, where I might want the quote to remain in a single segment.) I was wondering if anyone could help me to figure out the break or exception rule for this, so that a segment separates after the quote, instead of after a full stop. I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect. If there's anywhere that offers more guidance about how to construct the rules, I'd also be interested in that, the manual is a bit sparse I thought. I'm also not sure how to distinguish between a break rule and an exception (no break) rule, there's a checkmark in the Segmentation dialog but I'm not sure what the checkmark means - if it doesn't mean "select this rule for further operations". How does the dialog allow me to set a break or a no break rule? Thanks for any help! ▲ Collapse | | | Samuel Murray Netherlands Local time: 15:09 Member (2006) English to Afrikaans + ... | esperantisto Local time: 17:09 Member (2006) English to Russian + ... SITE LOCALIZER
Thijs Vissia wrote: I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect. For starters, ", “ and ” are three different symbols, thus \.\" won’t work for ”. In order to understand why your attempts failed, share: - a short sample file/sample project;
- your segmentation rules (i. e. your segmentation.conf).
Thijs Vissia wrote: there's a checkmark in the Segmentation dialog but I'm not sure what the checkmark means - if it doesn't mean "select this rule for further operations". How does the dialog allow me to set a break or a no break rule? Every rule set in the dialog is applied. The checkmark when ticked means that the rule makes a break. If not ticked, the rule is a joiner.
[Edited at 2020-01-08 14:37 GMT] | | | Break rules and exceptions | Jan 8, 2020 |
Thijs Vissia wrote: I've tried it with \.\" (in the Before field) and \s (in the After field), as well as the same with curly quotes, but both don't seem to have the desired effect. That is correct, except that, as others said, \" does not cover character ” Try with \.[\"”] instead Ensure also that this rule appears before the rule with before = \. and after = \s ; Rules order is important because the segmenter will apply them in the order they appear, and once a rule affects a location in your phrase, no rule can affect the same location anymore. Other option I would test: before = \. and after = [\"”], but then it is an exception, not a rule. See why in the following. Thijs Vissia wrote: I'm also not sure how to distinguish between a break rule and an exception (no break) rule, An exception means that we do not want to cut, even if the rules which follow the given one say that we should. Example: 1. Normally after a dot, we want to cut. This is a break rule which is usually at the end of the rules set. 2. However, after "Mr." you should not cut because this is an abbreviation (example: Mr. Smith said that... ==> if we only apply rule 1, the segmenter will cut aftert the dot; the exception prevents that, but only if exception is declared before the rule) esperantisto wrote: Every rule set in the dialog is applied. The checkmark when ticked means that the rule makes a break. If not ticked, the rule is a joiner. Really? Contrary of a break rule is an exception (a location where you do not want to break). I don't see what a joiner is, since segmentation rules are only used to cut segments (when you want to join them, the rule seems to be always using spaces between joined segments). | |
|
|
Thanks for your responses. I later gathered that the checkmark in front of each rule would mean an exception, or a "no-break rule". But I'll look into that further and look at the manual again, I'm not as focussed right now. tcordonniery, Thanks for your rewritten rule, that does appear to do the job. In my current (in fact, the default) English language segmentation rules, there doesn't seem to be a rule that this one should appear before, as you write ("the rule... See more Thanks for your responses. I later gathered that the checkmark in front of each rule would mean an exception, or a "no-break rule". But I'll look into that further and look at the manual again, I'm not as focussed right now. tcordonniery, Thanks for your rewritten rule, that does appear to do the job. In my current (in fact, the default) English language segmentation rules, there doesn't seem to be a rule that this one should appear before, as you write ("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.) In any case, the segment breaks now appear where I want them, hoping it doesn't cause any adverse results elsewhere. Many thanks for the help, all! Oh, another related question I just thought of: I supposed that these segmentation rules are set for OmegaT as a whole, and not per project. How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them? Does that mean I lose the target segments (when I open the project, or when I save it)? Or not? ▲ Collapse | | | esperantisto Local time: 17:09 Member (2006) English to Russian + ... SITE LOCALIZER Project-specific rules | Jan 9, 2020 |
Thijs Vissia wrote: ("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.) It is a rule for very many languages. Thus, it is not language-specific and can be found under the Default rule set. Thijs Vissia wrote: How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them? Two options basically: 1. (Simple) Create a project-specific rule set: Project properties → Segmentation → Make segmentation rules project-specific. 2. (Not so simple) Create a separate user profile and start OmegaT to use it when you want to work with specific projects: 7. Starting OmegaT from the command line . | | |
esperantisto wrote: Thijs Vissia wrote: ("the rule with before = \. and after = \s" isn't to be found, though this seems a little odd.) It is a rule for very many languages. Thus, it is not language-specific and can be found under the Default rule set. Thijs Vissia wrote: How do I prevent such changes from messing up other translation projects, since these would presumably be re-segmented upon opening them? Two options basically: 1. (Simple) Create a project-specific rule set: Project properties → Segmentation → Make segmentation rules project-specific. 2. (Not so simple) Create a separate user profile and start OmegaT to use it when you want to work with specific projects: 7. Starting OmegaT from the command line . Great, thank you very much. (The first part of your answer I figured out after posting, it's there as before:"\.\?\!" and after:"\s" in the Default rules, if I'm not mistaken.) | | | Using Java and Unicode in segmentation rules | Feb 1, 2020 |
Hi. I don't know if this question is still relevant, nevertheless here are my findings on segmentation rules. The rules use Java regular expressions. Here is a tutorial: https://docs.oracle.com/javase/tutorial/essential/regex/index.html Great feature of Java regular expressions is support of Unicode character categories. Here is a list of them: <... See more Hi. I don't know if this question is still relevant, nevertheless here are my findings on segmentation rules. The rules use Java regular expressions. Here is a tutorial: https://docs.oracle.com/javase/tutorial/essential/regex/index.html Great feature of Java regular expressions is support of Unicode character categories. Here is a list of them: https://en.wikipedia.org/wiki/Unicode_character_property#General_Category Using that information I constructed following segmentation rule in order to deal with quotation marks, brackets and tags at the beginning and at the end of English sentences. Add this as checked line (break) at the bottom of "Default" rules section: before: [\.\?\!]+[\p{Pf}\p{Pe}(<[\w/]+>)]* after: \s+[\p{Ps}\p{Pi}(<[\w/]+>)]*\p{Lu} Some explanations: before: - [\.\?\!]+ means that any of these punctuation marks (or their combination) occurs one or more times
- \p{Pf} is closing quotation mark
- \p{Pe} is closing bracket
- <[\w/]+> is a tag, i.e. some combination of slash (/), letters and digits (\w) in angle brackets
- [\p{Pf}\p{Pe}(<[\w/]+>)]* means that any combination of closing quotation mark, closing bracket and a tag occurs zero or more times
and so on.
[Edited at 2020-02-01 09:22 GMT] ▲ Collapse | |
|
|
hi Andrey, Pardon my late reply, but thank you for pointing this out, very good to know. And thanks also for taking the trouble of constructing an expression to deal with this, I will give it a try. best, Thijs | | | There is no moderator assigned specifically to this forum. To report site rules violations or get help, please contact site staff » Segmentation rule to deal with quote marks Wordfast Pro | Translation Memory Software for Any Platform
Exclusive discount for ProZ.com users!
Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value
Buy now! » |
| Protemos translation business management system | Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!
The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |