|
316 | 316 | "cell_type": "markdown", |
317 | 317 | "metadata": {}, |
318 | 318 | "source": [ |
319 | | - "# Composition" |
| 319 | + "# Maintainable expressions" |
320 | 320 | ] |
321 | 321 | }, |
322 | 322 | { |
323 | 323 | "cell_type": "markdown", |
324 | 324 | "metadata": {}, |
325 | 325 | "source": [ |
326 | 326 | "Sophisticated regular expressions tend to be very hard to read. There are a couple of things you can do to mitigate that issue.\n", |
327 | | - "* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions.\n", |
328 | | - "* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression." |
| 327 | + "* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression.\n", |
| 328 | + "* Use named captures.\n", |
| 329 | + "* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions." |
329 | 330 | ] |
330 | 331 | }, |
331 | 332 | { |
|
457 | 458 | "cell_type": "markdown", |
458 | 459 | "metadata": {}, |
459 | 460 | "source": [ |
460 | | - "Although the final regular expression is still rather long, it is easier to read and to maintain. Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable." |
| 461 | + "To avoid a long and tedius argument list, it is more convenient to store the subexpressions into a dictionary." |
461 | 462 | ] |
462 | 463 | }, |
463 | 464 | { |
464 | 465 | "cell_type": "code", |
465 | 466 | "execution_count": 15, |
466 | 467 | "metadata": {}, |
| 468 | + "outputs": [], |
| 469 | + "source": [ |
| 470 | + "regex_parts = {\n", |
| 471 | + " 'date': r'\\d{4}-\\d{2}-\\d{2}',\n", |
| 472 | + " 'time': r'\\d{2}:\\d{2}:\\d{2}\\.\\d+',\n", |
| 473 | + "}" |
| 474 | + ] |
| 475 | + }, |
| 476 | + { |
| 477 | + "cell_type": "markdown", |
| 478 | + "metadata": {}, |
| 479 | + "source": [ |
| 480 | + "Overall, this can be further improved by using named capture groups." |
| 481 | + ] |
| 482 | + }, |
| 483 | + { |
| 484 | + "cell_type": "code", |
| 485 | + "execution_count": 16, |
| 486 | + "metadata": {}, |
| 487 | + "outputs": [], |
| 488 | + "source": [ |
| 489 | + "regex_parts['datetime'] = r'(?P<datetime>{date}\\s+{time})'.format(**regex_parts)" |
| 490 | + ] |
| 491 | + }, |
| 492 | + { |
| 493 | + "cell_type": "markdown", |
| 494 | + "metadata": {}, |
| 495 | + "source": [ |
| 496 | + "Now the match can be retrieved by name, rather than index, this makes the code less error prone and more robust to change." |
| 497 | + ] |
| 498 | + }, |
| 499 | + { |
| 500 | + "cell_type": "code", |
| 501 | + "execution_count": 17, |
| 502 | + "metadata": {}, |
| 503 | + "outputs": [ |
| 504 | + { |
| 505 | + "data": { |
| 506 | + "text/plain": [ |
| 507 | + "'2021-08-25 17:04:23.439405'" |
| 508 | + ] |
| 509 | + }, |
| 510 | + "execution_count": 17, |
| 511 | + "metadata": {}, |
| 512 | + "output_type": "execute_result" |
| 513 | + } |
| 514 | + ], |
| 515 | + "source": [ |
| 516 | + "match = re.match(regex_parts['datetime'], log_entry)\n", |
| 517 | + "match.group('datetime')" |
| 518 | + ] |
| 519 | + }, |
| 520 | + { |
| 521 | + "cell_type": "code", |
| 522 | + "execution_count": 18, |
| 523 | + "metadata": {}, |
| 524 | + "outputs": [], |
| 525 | + "source": [ |
| 526 | + "regex_parts['log_level'] = r'\\[(?P<log_level>\\w+)\\]'\n", |
| 527 | + "regex_parts['log_msg'] = r'end\\s+process\\s+(?P<process_id>\\d+)\\s+exited\\s+with\\s+(?P<exit_status>\\d+)'" |
| 528 | + ] |
| 529 | + }, |
| 530 | + { |
| 531 | + "cell_type": "markdown", |
| 532 | + "metadata": {}, |
| 533 | + "source": [ |
| 534 | + "Although the final regular expression is still rather long, it is easier to read and to maintain. Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable." |
| 535 | + ] |
| 536 | + }, |
| 537 | + { |
| 538 | + "cell_type": "code", |
| 539 | + "execution_count": 19, |
| 540 | + "metadata": {}, |
467 | 541 | "outputs": [ |
468 | 542 | { |
469 | 543 | "name": "stdout", |
|
478 | 552 | ], |
479 | 553 | "source": [ |
480 | 554 | "regex = re.compile(r'''\n", |
481 | | - " ({date}\\s+{time})\\s+ # date-time, up to microsecond presision\n", |
482 | | - " {level}\\s*:\\s* # log level of the log message\n", |
483 | | - " {msg} # actual log message\n", |
484 | | - " '''.format(date=date, time=time, level=level, msg=msg), re.VERBOSE)\n", |
| 555 | + " {datetime}\\s+ # date-time, up to microsecond presision\n", |
| 556 | + " {log_level}\\s*:\\s* # log level of the log message\n", |
| 557 | + " {log_msg} # actual log message\n", |
| 558 | + " '''.format(**regex_parts), re.VERBOSE)\n", |
485 | 559 | "match = regex.match(log_entry)\n", |
486 | | - "print(f'datetime = {match.group(1)}')\n", |
487 | | - "print(f'log level: {match.group(2)}')\n", |
488 | | - "print(f'process = {match.group(3)}')\n", |
489 | | - "print(f'exit status = {match.group(4)}')" |
| 560 | + "print(f\"datetime = {match.group('datetime')}\")\n", |
| 561 | + "print(f\"log level: {match.group('log_level')}\")\n", |
| 562 | + "print(f\"process = {match.group('process_id')}\")\n", |
| 563 | + "print(f\"exit status = {match.group('exit_status')}\")" |
| 564 | + ] |
| 565 | + }, |
| 566 | + { |
| 567 | + "cell_type": "markdown", |
| 568 | + "metadata": {}, |
| 569 | + "source": [ |
| 570 | + "**Note:** as up to Python 3.9 (and perhaps later versions), f-strings can not contain backslashes, hence the use of the `format` method for string substitution." |
490 | 571 | ] |
491 | 572 | } |
492 | 573 | ], |
|
0 commit comments