[99% SOLVED] How to match Cyrillic characters with a regular expression

Off-topic posts of interest to the "Everything" community.
Post Reply
Debugger
Posts: 565
Joined: Thu Jan 26, 2017 11:56 am

[99% SOLVED] How to match Cyrillic characters with a regular expression

Post by Debugger »

Text Editor:
Finding Russian sentences containing one or more dots and dashes and number and comma etc. and it is very important to look only in one line!


[ЁёА-Яа-я„«—]

example:
Что такое стиль. Настольная книга для писательницы
Что такое стиль. Настольная книга для писательницы - 2
Что такое стиль. Настольная — писательницы -
3. Что такое стиль. Настольная книга для писательницы
Last edited by Debugger on Thu Apr 18, 2019 9:15 am, edited 1 time in total.
therube
Posts: 4580
Joined: Thu Sep 03, 2009 6:48 pm

Re: How to match Cyrillic characters with a regular expression

Post by therube »

(Of course I'm not following, but...)


Separate your letters out first.

regex:[ЁёА-Яа-я] (or regex:[ЁёА-я], I think)

> В цепях древней тайны.mp3
> Славься, Русь!.mp3

Then add your punctuation.
Will that work?

regex:[ЁёА-Яа-я] regex:[,„«—]+

> Славься, Русь!.mp3
Debugger
Posts: 565
Joined: Thu Jan 26, 2017 11:56 am

Re: How to match Cyrillic characters with a regular expression

Post by Debugger »

Regular expression WRONG.
Bad match of all characters in one line.
Finding virtually the some text than it should. It should not match normal text, for example, without searching for CHARACTER and other characters throughout the text, rather than being strictly defined on a single line that contains at least a text in Russian.

Not need operator regex:


Example:

Line1: Russian text and or not and other char
Line2: Russian text
Line3: Polish text
(Separator)Line4:===
Line5: Russian text and or not and other char
Line6: Russian text
Line7: Polish text
(Separator)Line8:===
Debugger
Posts: 565
Joined: Thu Jan 26, 2017 11:56 am

Re: How to match Cyrillic characters with a regular expression

Post by Debugger »

void - Well, yes, but I can not find anything on the subject that a regular expression in one line must include strictly defined characters (Russian), can not contain mixed text, English, Polish, German and other the same characters, etc.

.+[ЁёА-Яа-я.,„”"«—0-9)(]\n
void
Developer
Posts: 15096
Joined: Fri Oct 16, 2009 11:31 pm

Re: How to match Cyrillic characters with a regular expression

Post by void »

Requires PCRE in multiline mode:

^([\p{Cyrillic}]+[\-\.—0-9]+[\p{Cyrillic}\-\.—0-9]*|[\-\.—0-9]+[\p{Cyrillic}]+[\p{Cyrillic}\-\.—0-9]*)$

This will also match at least one Cyrillic character, which I assume you want, otherwise it would match a long string of numbers or dashes or dots.

^ = match start of string (or line, in multiline mode)
[] = match character in a set
\p{Cyrillic} = match a Cyrillic character
\- = match a literal -
\. = match a literal .
+ = match previous element one or more times.
* = match previous element zero or more times.
$ = match end of string (or line, in multiline mode)
Debugger
Posts: 565
Joined: Thu Jan 26, 2017 11:56 am

Re: How to match Cyrillic characters with a regular expression

Post by Debugger »

Unfortunately, I do not use PCRE, but I switched to the Onigmo engine and it will work.

I have modified a of the regex:
^([\p{Cyrillic}]+[\-\.\,\!\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]*|[\-\.\,\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]+[\p{Cyrillic}]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]*)$

but wrong regex

Text included:
\p{Cyrillic}
!
!!
!!!
!!!!
?
??
???
… (unicode)
— (unicode)
-
.
..
...
,
0-9
(
)
„ (unicode)
„ (unicode)
"
\s (space)
\
/
\x{200B} or really maybe .\x{200B}
*
#
@
&
:
;
Post Reply