Python String Parsing for Beginners and Experts

Python provides a variety of ways to work with strings. In this blog post, we will first introduce the basics of string parsing in Python and clarify the difference between string indexing and string sequencing.

String indexing

In Python, a String a sequence of characters. These characters can be accessed via an index, which is used for 0 begins. This way you can access the first, second and third character of a string:

text = "Python"
print(text[0]) # Prints 'P'.
print(text[1]) # Prints 'y'.
print(text[2]) # Outputs 't'.

You can also use negative indexes to access characters from the end of the string:

print(text[-1]) # Outputs 'n'.
print(text[-2]) # Print 'o
print(text[-3]) # Outputs 'h

String sequence

A string sequence is a part of a string specified by a start and end index. In Python, you can use the so-called slicing syntax to access parts of a string:

pythonCopy codetext = "Python"
print(text[0:3]) # Prints 'Pyt'.
print(text[1:4]) # Prints 'yth'.
print(text[:2]) # Prints 'Py' (without initial index it corresponds to index 0)
print(text[3:]) # Outputs 'hon' (without end index it equals index + 1)

The slicing syntax also works with negative indices:

print(text[-3:-1]) # Prints 'ho'.
print(text[:-2]) # Prints 'Pyth'.

To summarize the differences between string indexing and string sequence: String indexing refers to accessing a single character in a string, while string sequence provides access to a portion of the string.

A Use Case for the Split Method in Python: Version Numbers and Commit Hashes

In many situations, such as working with software projects, we often need to extract information from combined strings. As an example, let's take a string that contains a version number and a commit hash: "2.7.0_bf4fda703454". To separate this information, we can use the split()-method in Python.

version_hash = "2.7.0_bf4fda703454"
version_hash_list = version_hash.split("_")

The result is a list of strings that looks like this:

['2.7.0', 'bf4fda703454']

Here the string is separated at each underscore. If you want the separation to occur only after the first underscore, you can use the method split() with the optional parameter 1 use

version_hash.split("_", 1)

If you are sure that the string contains an underscore, you can even unwrap the left and right parts into separate variables:

lhs, rhs = version_hash.split("_", 1)
print(lhs) # Prints '2.7.0'.
print(rhs) # Prints 'bf4fda703454'.

An alternative to the split()-method is the use of partition(). The usage is similar to the last example, except that here three components are returned instead of two. The main advantage of this method is that it does not fail if the string does not contain the separator.

In general, the split()-method provides a convenient way to split and process combined information in Python. In our example, we could easily separate the version number and the commit hash.

String parsing with regular expressions - Regex

Decompose text with named regex groups in Python and weigh advantages and disadvantages against traditional functions such as Python string indexing, sequencing or split method

When we have complex strings that need to be decomposed due to patterns or difficult punctuation, regular expressions can help recognize and decompose these patterns.

Here is a code example of string parsing with named regex groups in Python:

import re
# Example text
text = "Max Mustermann, Age: 30, City: Berlin".
# Regular expression with named groups
pattern_text = r'(?P[A-Za-zäöüÄÖß\s]+),\s+age:\s+(?P\d+)\s+years,\s+city:\s+(?P[A-Za-zäöüÄÖß\s]+)'
# Compile the regular expression
pattern = re.compile(pattern_text)
# Search for matches in the text
match = pattern.match(text)
# Check if a match was found
if match:
    # Extract the named groups
    name = match.group('name')
    age = match.group('age')
    city = match.group('city')
    # Output the extracted information
    print(f "Name: {name}")
    print(f "age: {age}")
    print(f "City: {city}")
else:
    print("No match found.")

In this example, we have a sample text with a person's name, age, and city. We use a regular expression with named groups to extract the relevant information from the text. The groups are named as follows:

  • name: for the name of the person
  • age: for the age of the person
  • city: for the city where the person lives

After the regular expression is compiled, we look for matches in the text. If a match is found, we extract the named groups and output the extracted information.

Advantages:

  1. Flexibility: Regular expressions offer a high degree of flexibility in text analysis and can recognize complex patterns.
  2. Compact: Regular expressions are often shorter and more concise than traditional methods.
  3. Named groups: With named groups we can easily identify and extract parts of the pattern.

Disadvantages:

  1. Learning curve: Regular expressions have a steeper learning curve and are often harder to read and understand than traditional methods.
  2. Limited applicability: Regular expressions are not suitable for deeply nested documents like HTML, XML or JSON.

Python string indexing, sequencing or split method: Traditional functions like Python string indexing, sequencing or split method are well suited for simple text analysis tasks where the structure of the string is clear and simple.

Advantages:

  1. Simplicity: Traditional methods are easy to understand and implement.
  2. Readability: The code is easier to read and understand for developers who are familiar with Python.

Disadvantages:

  1. Limited flexibility: These methods offer less flexibility in analyzing complex text patterns.
  2. Longer code: Code can become longer and more confusing when trying to identify and decompose complex patterns.

Automate Python Regex

In this example, we will show how to use the Konfuzio SDK to automate the creation of regex. We use the SDK to create a custom regex tokenizer for the label "ContractDate" to create. More information is available in the technical documentation and in a more blog articles to find.

from konfuzio_sdk.data import Project
from konfuzio_sdk.tokenizer.regex import RegexTokenizer
from konfuzio_sdk.tokenizer.base import ListTokenizer
# Replace YOUR_PROJECT_ID and YOUR_CATEGORY_ID with the corresponding IDs of your project and category.
my_project = Project(id_=YOUR_PROJECT_ID)
category = my_project.get_category_by_id(id_=YOUR_CATEGORY_ID)
tokenizer = ListTokenizer(tokenizers=[])
label = my_project.get_label_by_name("ContractDate")
# Find regex expressions associated with occurrences of the label "ContractDate"match
for regex in label.find_regex(category=category):
    regex_tokenizer = RegexTokenizer(regex=regex)
    tokenizer.tokenizers.append(regex_tokenizer)
# Use the created regex tokenizer to create annotations for each matching string in a document.
# Replace YOUR_DOCUMENT_ID with the corresponding ID of your document.
document = my_project.get_document_by_id(YOUR_DOCUMENT_ID)
tokenizer.tokenize(document)

In this code snippet:

  1. Let's import the required classes and functions from the Konfuzio SDK.
  2. Let's create a Project-object by specifying the project ID.
  3. Do we get a Category-object by specifying the category ID.
  4. Let's create an empty ListTokenizer.
  5. Do we get the Label-object for the "ContractDate„.
  6. Let's look for regex expressions that start with occurrences of the label "ContractDate" in the training data category match.
  7. Let's add each regex expression found as a RegexTokenizer the ListTokenizer added.
  8. Let's use the created regex tokenizer to create annotations for each matching string in a document. To do this, we specify the document ID and call the tokenize()-method of the tokenizer.

After the tokenizer is created, you can use it to automatically generate regex expressions for the "ContractDate" in your documents and create annotations for them.

"
"
Florian Zyprian Avatar

Latest articles