Processor Examples

From this page, you can see a few example of processors in order to simplify the creation of your first processor sub class.

Simple and looping processors

The first two examples of this library are demonstrating how you can implement a simple processor that execute all calculations in one go and a looping processor where you need to loop over a list of items to get either a cumulative results.

AccumulatorProcessor is calculating the sum of the first N integer numbers in a loop. The processor takes the last_number as an input to include and put the output in the accumulated_value parameter. This process is very inefficient, but it is here to demonstrate how to subclass a looping processor.

class AccumulatorProcessor(Processor):
    r"""
    A processor to calculate the sum of the first n values via a looping approach.

    In mathematical terms, this processor solves this easy equation:

    .. math::

        N = \sum_{i=0}^{n}{i}

    by looping. It is a terribly inefficient approach, but it works as a demonstration of the looping structure.

    The user can get the results by retrieving the `accumulated_value` parameter at the end of the processor
    execution.
    """

    last_value = ActiveParameter('last_value', default=100, help_doc='Last value of the series')

    def __init__(self, *args, **kwargs):
        """Constructor parameters:

        :param last_value: The `n` in the equation above. Defaults to 100
        :type last_value: int
        :param accumulated_value: The `N` in the equation above at the end of the process.
        :type accumulated_value: int
        """
        super().__init__(*args, **kwargs)
        self.accumulated_value: int = 0

    def start(self):
        """Resets the accumulated value to 0 before starting."""
        super().start()
        self.accumulated_value = 0

    def get_items(self) -> list[int]:
        """Returns the list of the first `last_value` integers."""
        return list(range(self.last_value))

    def process(self):
        """Increase the accumulated value by the current item."""
        self.accumulated_value += self.item

GaussAdder is calculating exactly the same result using the Gauss formula, eliminating the need for any looping. Indeed the looping is disabled and the output is the same.

class GaussAdder(Processor):
    r"""
    A processor to calculate the sum of the first n values via the so called *Gauss formula*.

    In mathematical terms, this processor solves this easy equation:

    .. math::

        N = \frac{n * (n - 1)}{2}

    without any looping

    The user can get the results by retrieving the `sum_value` parameter at the end of the processor
    execution.
    """

    last_value = ActiveParameter('last_value', default=100, help_doc='Last value of the series.')

    def __init__(self, *args, **kwargs):
        """
        Constructor parameters:

        :param last_value: The `n` in the equation above. Defaults to 100
        :type last_value: int
        :param sum_value: The `N` in the equation above.
        :type sum_value: int
        """
        super().__init__(looper=LoopType.SingleLoop, *args, **kwargs)
        self.sum_value: int = 0

    def start(self):
        """Sets the sum value to 0."""
        super().start()
        self.sum_value = 0

    def process(self):
        """Compute the sum using the Gauss formula."""
        self.sum_value = int(self.last_value * (self.last_value - 1) / 2)

If you carefully look at line 28, you will notice that in the GaussAdder constructor, the looper option is set to SingleLoop and as we have seen, it means that that the processor will follow the single loop execution workflow.

The definition of the looper parameter in the init method can be sometimes hard to remember and unpractical especially if you have to overload the init method just to set the value of the looper. In such circumstances the use of a class decorator can be very handy. MAFw makes you available three class decorators for this purpose, to transform a processor in a single loop, a for loop or a while loop.

Using the decorator approach the GaussAdder above can be re-written in this way:

@single_loop
class GaussAdder(Processor):
    # the rest of the implementation remains the same

And here below is an example of execution of the two.

from mafw.examples.sum_processor import GaussAdder, AccumulatorProcessor

n = 35

# create the two processors
accumulator = AccumulatorProcessor(last_value=n)
gauss = GaussAdder(last_value=n)

# execute them
accumulator.execute()
gauss.execute()

# print the calculated results
print(accumulator.accumulated_value)
print(gauss.sum_value)

This will generate the following output:

595
595

Modify the for loop cycle using the `LoopingStatus`

In a looping processor, the process() method is invoked inside a loop, but the user can decide to skip a certain item and even to interrupt the (abort or quit) the loop.

The tool to achieve this is the LoopingStatus. This is set to Continue at the beginning of each iteration, but the user can turn it Skip, Abort or Quit inside the implementation of process().

When set to Skip, a special callback is invoked skip_item() where the user can do actions accordingly. When set to Abort or Quit, the loop is broken and the user can decide what to do in the finish() method. Those two statuses seams to be redundant, but this gives the user the freedom to decide if everything was wasted (Abort) or if what done so far was still acceptable (Quit).

Here below is the implementation of a simple processor demonstrating such a functionality.

class ModifyLoopProcessor(Processor):
    """
    Example processor demonstrating how it is possible to change the looping structure.

    It is a looping processor where some events will be skipped, and at some point one event will trigger an abort.
    """

    total_item: ActiveParameter[int] = ActiveParameter('total_item', default=100, help_doc='Total item in the loop.')
    items_to_skip: ActiveParameter[list[int]] = ActiveParameter(
        'items_to_skip', default=[12, 16, 25], help_doc='List of items to be skipped.'
    )
    item_to_abort: ActiveParameter[int] = ActiveParameter('item_to_abort', default=65, help_doc='Item to abort')

    def __init__(self, *args, **kwargs):
        """
        Processor Parameters:

        :param total_item: The total number of items
        :type total_item: int
        :param items_to_skip: A list of items to skip.
        :type items_to_skip: list[int]
        :param item_to_abort: The item where to trigger an abort.
        :type item_to_abort: int

        """
        super().__init__(*args, **kwargs)
        self.skipped_items: [list[int]] = []
        """A list with the skipped items."""

    def start(self):
        """Resets the skipped item container."""
        super().start()
        self.skipped_items = []

    def get_items(self) -> list[int]:
        """Returns the list of items, the range from 0 to total_item."""
        return list(range(self.total_item))

    def process(self):
        """Processes the item"""
        if self.item in self.items_to_skip:
            self.looping_status = LoopingStatus.Skip
            return
        if self.item == self.item_to_abort:
            self.looping_status = LoopingStatus.Abort
            return

    def skip_item(self):
        """Add skipped item to the skipped item list."""
        self.skipped_items.append(self.item)

And here below is how the processor can be used.

import random
from mafw.examples.loop_modifier import ModifyLoopProcessor

# generate a random number corresponding to the last item
last_value = random.randint(10, 1000)

# get a sample with event to be skipped
skip_items = random.sample(range(last_value), k=4)

# find an event to abort after the last skipped one
max_skip = max(skip_items)
if max_skip + 1 < last_value:
    abort_item = max_skip + 1
else:
    abort_item = last_value - 1

# create the processor and execute it
mlp = ModifyLoopProcessor(total_item=last_value, items_to_skip=skip_items, item_to_abort=abort_item)
mlp.execute()

# compare the recorded skipped items with the list we provided.
assert mlp.skipped_items == list(sorted(skip_items))

# check that the last item was the abort item.
assert mlp.item == abort_item

For and while loop execution workflow

We have seen in the previous chapter that there are different type of loopers and in the previous section we have seen in practice the execution workflow of a single loop and a while loop processor.

In this example, we will explore the difference between the for loop and the while loop execution workflow. Both processors will run the Processor.process() method inside a loop, but for the former we will loop over a pre-established list of items, while for the latter we will continue repeating the process until a certain condition is valid.

Both processors will work with prime number and we will use this helper function to check if an integer number is prime or not.

def is_prime(n: int) -> bool:
    """
    Check if n is a prime number.

    :param n: The integer number to be checked.
    :type n: int
    :return: True if n is a prime number. False, otherwise.
    :rtype: bool
    """
    prime = True
    if n < 2:
        prime = False
    elif n == 2:
        prime = True
    elif n % 2 == 0:
        prime = False
    else:
        sqrt_n = int(math.floor(math.sqrt(n)))
        for i in range(3, sqrt_n + 1, 2):
            if n % i == 0:
                prime = False

    return prime

The task of the for loop processor is to find all prime numbers included in a given user defined range of integer numbers. In other words, we want to find all prime numbers between 1000 and 2000, for example. The brute force approach is to start a loop on 1000, check if it is prime and if not check the next one until you get to 2000. If a number is actually prime, then store it in a list for further use.

For the sake of clarity, along with the API documentation, we are copying here also the processor source code.

@for_loop
class FindPrimeNumberInRange(Processor):
    """
    An example processor to find prime numbers in the defined interval from ``start_from`` to ``stop_at``.

    This processor is meant to demonstrate the use of a for_loop execution workflow.

    Let us say we want to select only the prime numbers in a user defined range. One possible brute force approach is
    to generate the list of integers between the range extremes and check if it is prime or not. If yes,
    then add it to the list of prime numbers, if not continue with the next element.

    This is a perfect application for a loop execution workflow.
    """

    start_from = ActiveParameter('start_from', default=50, help_doc='From which number to start the search')
    stop_at = ActiveParameter('stop_at', default=100, help_doc='At which number to stop the search')

    def __init__(self, *args: Any, **kwargs: Any):
        """
        Processor parameters:

        :param start_from: First element of the range under investigation.
        :type start_from: int
        :param stop_at: Last element of the range under investigation.
        :type stop_at: int
        """
        super().__init__(*args, **kwargs)
        self.prime_num_found: list[int] = []
        """The list with the found prime numbers"""

This is the class definition with its constructor. As you can see, we have decorated the class with the for loop decorator even though it is not strictly required because the for loop is the default execution workflow.

We have added two processor parameters, the start_from and the stop_at to allow the user to specify a range on interest where to look for prime numbers.

In the init method, we create a list of integer to store all the prime numbers that we will finding during the process.

Now let us overload all compulsory methods for a for loop processor.

def get_items(self) -> Collection[Any]:
    """
    Overload of the get_items method.

    This method must be overloaded when you select a for loop workflow.

    Here we generate the list of odd numbers between the start and stop that we need to check.
    We also check that the stop is actually larger than the start, otherwise we print an error message, and we
    return an empty list of items.

    :return: A list of odd integer numbers between start_from and stop_at.
    :rtype: list[int]
    """
    if self.start_from >= self.stop_at:
        log.critical('%s must be smaller than %s' % (self.start_from, self.stop_at))
        return []

    if self.start_from != 2 and self.start_from % 2 == 0:
        self.start_from += 1

    if self.stop_at != 2 and self.stop_at % 2 == 0:
        self.stop_at -= 1

    return list(range(self.start_from, self.stop_at, 2))

The get items method is expected to return a list of items, that will be processed by the Processor.process() method. It is absolutely compulsory to overload this method, otherwise the whole loop structure will not have a list to loop over.

And now, let us have a look at the three stages: start, process and finish.

def start(self) -> None:
    """
    Overload of the start method.

    **Remember:** to call the super method when you overload the start.

    In this specific case, we just make sure that the list of found prime numbers is empty.
    """
    super().start()
    self.prime_num_found = []

def process(self) -> None:
    """
    The process method.

    In this case, it is very simple. We check if :attr:`.Processor.item` is a prime number, if so we added to the list,
    otherwise we let the loop continue.
    """
    if is_prime(self.item):
        self.prime_num_found.append(self.item)

def finish(self) -> None:
    """
    Overload of the finish method.

    **Remember:** to call the super method when you overload the finish method.

    In this case, we just print out some information about the prime number found in the range.
    """
    super().finish()
    log.info(
        'Found %s prime numbers in the range from %s to %s'
        % (len(self.prime_num_found), self.start_from, self.stop_at)
    )
    if len(self.prime_num_found):
        log.info('The smallest is %s', self.prime_num_found[0])
        log.info('The largest is %s', self.prime_num_found[-1])

These three methods are the core of the execution workflow, so it is obvious that you have to overload them. Keep in mind to always include a call to the super method when you overload the start and finish because they perform some tasks also in the basic processor implementation. The code is written in a straightforward manner and includes clear, thorough explanations in the docstring.

The looping parameters: Processor.i_item, Processor.n_item and Processor.item can be used while implementing the process() and finish(). The n_item is calculated soon after the list of items is returned, while item, i_item are assigned in the for loop as the current item and its enumeration.

Optionally, one can overload the format_progress_message() in order to generate a nice progress message informing the user that something is happening. This is an example:

def format_progress_message(self) -> None:
    self.progress_message = (
        f'Checking integer number: {self.item}, already found {len(self.prime_num_found)} prime numbers'
    )

The task for the while loop processor is again about prime number finding but different. We want to find a certain number of prime numbers starting from an initial value. We cannot generate a list of integer number and loop over that in the FindPrimeNumberInRange, but we need to reorganize our workflow in order to loop until the number of found primes is equal to the requested one.

This is how such a task can be implemented using the while loop execution framework. You can find the example in the API documentation and an explanation of the here below.

Let us start again from the class definition.

@while_loop
class FindNPrimeNumber(Processor):
    """
    An example of Processor to search for N prime numbers starting from a given starting integer.

    This processor is meant to demonstrate the use of a while_loop execution workflow.

    Let us say we need to find 1000 prime numbers starting from 12347. One possible brute force approach to solve this
    problem is to start checking if the initial value is a prime number. If this is not the case, then check the next
    odd number. If it is the case, then add the current number to the list of found prime numbers and continue until
    the size of this list is 1000.

    This is a perfect application for a while loop execution workflow.
    """

    prime_num_to_find = ActiveParameter(
        'prime_num_to_find', default=100, help_doc='How many prime number we have to find'
    )
    start_from = ActiveParameter('start_from', default=50, help_doc='From which number to start the search')

    def __init__(self, *args: Any, **kwargs: Any):
        """
        Processor parameters:

        :param prime_num_to_find: The number of prime numbers to be found.
        :type prime_num_to_find: int
        :param start_from: The initial integer number from where to start the search.
        :type start_from: int
        """
        super().__init__(*args, **kwargs)
        self.prime_num_found: list[int] = []
        """The list with the found prime numbers"""

The first difference compared to the previous case is the use of the while_loop() decorator, this time it is really necessary to specify the processor LoopType because the while loop is not the default strategy.

The processor has two parameters, the number of prime number to find and from where to start. Similarly as before, in the init method, we define a list of integer to store all the prime numbers that we have found.

For while loop processor, we don’t have a list of items, but we need to have a condition either to continue or to stop the loop. For this reason we need to overload the while_condition() method, keeping in mind that we return True if we want the cycle to continue for another iteration and False otherwise.

Here is the implementation of the while_condition() for the FindNPrimeNumber.

def while_condition(self) -> bool:
    """
    Define the while condition.

    First, it checks if the prime_num_to_find is positive. Otherwise, it does not make sense to start.
    Then it will check if the length of the list with the already found prime numbers is enough. If so, then we can
    stop the loop return False, otherwise, it will return True and continue the loop.

    Differently from the for_loop execution, we are responsible to assign the value to the looping variables
    :attr:`.Processor.i_item`, :attr:`.Processor.item` and :attr:`.Processor.n_item`.

    In this case, we will use the :attr:`.Processor.i_item` to count how many prime numbers we have found and :attr:`.Processor.n_item`
    will be our target. In this way, the progress bar will work as expected.

    In the while condition, we set the :attr:`.Processor.i_item` to the current length of the found prime number list.

    :return: True if the loop has to continue, False otherwise
    """
    if self.prime_num_to_find <= 0:
        log.warning('You requested to find a negative number of prime numbers. It makes no sense.')
        return False

    self.i_item = len(self.prime_num_found)
    return self.i_item < self.prime_num_to_find

For a while loop, it is not easy to define an enumeration parameter and also the total number of items might be misleading. It is left to the user to decide if they want to use them or not. If yes, their definition and incrementation is under their responsability. For this processor, it was natural to consider the requested number of primes as the n_item and consequently the value of i_item can be utilized to keep track of the quantity of prime numbers that have already been discovered. This choice is very convenient because then progress bar that uses i_item and n_item to calculate the progress will show the actual progress. In case, you do not have any way to assign a value to n_item, do not do it, or set it to None. In this way, the progress bar will display an indeterminate progress . You can set the value of n_item either in the start() or in the while_condition(), with a performance preference with the first option because it is executed only once before the start of the loop.

Here below is the implementation of the three stages.

def start(self) -> None:
    """
    The overload of the start method.

    **Remember:** The start method is called just before the while loop is started. So all instructions in this
    method will be executed only once at the beginning of the process execution. Always put a call to its `super`
    when you overload start.

    First, we empty the list of found prime numbers. It should not be necessary, but it makes the code more readable.
    Then set the :attr:`.Processor.n_item` to the total number of prime numbers we need to find. In this way, the progress bar
    will display useful progress.

    If the start value is smaller than 2, then let's add 2 to the list of found prime number and set our first
    item to check at 3. In principle, we could already add 3 as well, but maybe the user wanted to find only 1
    prime number, and we are returning a list with two, that is not what he was expecting.

    Since prime numbers different from 2 can only be odd, if the starting number is even, increment it already by
    1 unit.
    """
    super().start()
    self.prime_num_found = []
    self.n_item = self.prime_num_to_find
    if self.start_from < 2:
        self.prime_num_found.append(2)
        self.start_from = 3

    if self.start_from % 2 == 0:
        self.item = self.start_from + 1
    else:
        self.item = self.start_from

def process(self) -> None:
    """
    The overload of the process method.

    **Remember:** The process method is called inside the while loop. It has access to the looping parameters:
    :attr:`.Processor.i_item`, :attr:`.Processor.item` and :attr:`.Processor.n_item`.

    In our specific case, the process contains another while loop. We start by checking if the current
    :attr:`.Processor.item` is a prime number or not. If so, then we have found the next prime number, we add it to the list,
    we increment by two units the value of :attr:`.Processor.item` and we leave the process method ready for the next iteration.

    If :attr:`.Processor.item` is not prime, then increment it by 2 and check it again.
    """
    while not is_prime(self.item):
        self.item += 2
    self.prime_num_found.append(self.item)
    self.item += 2

def finish(self) -> None:
    """
    Overload of the finish method.

    **Remember:** The finish method is called only once just after the last loop interaction.
    Always put a call to its `super` when you overload finish.

    The loop is over, it means that the while condition was returning false, and now we can do something with our
    list of prime numbers.
    """
    super().finish()
    log.info('Found the requested %s prime numbers' % len(self.prime_num_found))
    log.info('The smallest is %s', self.prime_num_found[0])
    log.info('The largest is %s', self.prime_num_found[-1])

Let us have a look at the FindNPrimeNumber.start(). First of all we set the value of Processor.n_item to our target value of primes. We use the Processor.item to store the current integer number being tested, so we initialize it to start_from or the first not prime odd number following it. In the FindNPrimeNumber.process() we need to include another while loop, this time we need to check the current value of Processor.item if it is a prime number. If yes, then we add it to the storage list, we increment it by two units (remember that for while loop processors it is your responsibility to increment the loop parameters) and we get ready for the next loop iteration. As for the other processor, we FindNPrimeNumber.finish() printing some statistics.

Importing elements to the database

Note

This example is using concepts that have not yet been introduced, in particular the database. So in a first instance, you can simply skip it and come back later.

Importing elements in the database is a very common task, that is required in all analytical projects. To accomplish this task, mafw is providing a dedicated base class (the Importer) that heavily relies on the use of the FilenameParser to extract parameters from the filenames.

The ImporterExample is a concrete implementation of the base Importer that can be used by the user to get inspiration in the development of their importer subclass.

Before diving into the ImporterExample code analysis, we should understand the role and the functionality of other two helper classes: the FilenameElement and the FilenameParser.

Retrieving information from filenames

When setting up an experimental plan involving the acquisition of several data files, there are different approaches.

The descriptive approach, where the filename is used to store information about the measurement itself,

the metadata approach, where the same information are stored inside the file in a metadata section,

or the logbook approach, where the filename is just a unique identifier and the measurement information are stored in a logbook (another file, database, piece of paper…) using the same unique identifier.

The descriptive approach, despite being sometime a bit messy because it may end up with very long filenames, it is actually very practical. You do not need to be a hacker including the metadata in the file itself and you do not risk to forget to add the parameters to the logbook.

The tricky part is to include those information to the database containing all your experiments, and you do not want to do this by hand to avoid errors.

The best way is to use regular expression that is a subject in which python is performing excellently and MAFw is helping you with two helpers.

The first helper is the FilenameElement. This represents one single piece of information that is stored in the filename.

Let us assuming that you have a file named as sample_12_energy_10_repetition_2.dat. You can immediately spot that there are three different pieces of information stored in the filename. The sample name, the value of the energy in some unit that you should known, and the value of the repetition. Very likely there is also a repetition_1 file saved on disc.

In order to properly interpret the information stored in the filename, we need to define three FilenameElement s, one for each of them!

If you look at the documentation of the FilenameElement, you will see that you need four arguments to build it:

its name, this is easy. Take one, and use it to name a named group in the regular expression.

its regular expression, this is tricky. This is the pattern that python is using to read and parse the actual element.

its type, this is the expected type for the element. It can be a string, an integer or a floating point number.

its default value, this is used to make the element optional. It means that if the element is not found, then the default value is returned. If no default value is provided and the element is not found then an error is raised.

Let us see how you could use FilenameElement class to parse the example filename.

filename = 'sample_12_energy_10_repetition_2.dat'

sample = FilenameElement('sample', r'[_]*(?P<sample>sample_\d+)[_]*', value_type=str)
energy = FilenameElement('energy', r'[_]*energy_(?P<energy>\d+\.*\d*)[_]*', value_type=float)
repetition = FilenameElement(
    'repetition', r'[_]*repetition_(?P<repetition>\d+)[_]*', value_type=int, default_value=1
)

sample.search(filename)
assert sample.value == 'sample_12'

energy.search(filename)
assert energy.value == 10

repetition.search(filename)
assert repetition.value == 2

The interesting thing is that you can swap the position of the elements in the filename, for example starting with the energy, and it will still be working absolutely fine.

Just open a python interpreter, import the FilenameElement class and give it a try yourself to familiarize with the regular expression. Be careful, when you write the regular expression pattern, since it usually contains a lot of ‘\’, it may be useful to prefix the string with a r, in order to inform python that what is coming must be interpreted as a raw string.

If you want to gain confidence with regular expressions, make some tests and understand their power, we recommend to play around with one of the many online tools available on the web, like pythex.

The FilenameElement is already very helpful, but if you have several elements in the filename, the readability of your code will quickly degrade. To help you further, you can enjoy the FilenameParser.

This is actually a combination of filename elements and when you will try to interpret the filename by invoking interpret() all of the filename elements will be parsed and thus you can retrieve all parameters in a much easier way.

If you look at the FilenameParser documentation, you will see that you need a configuration file to build an instance of it. This configuration file is actually containing the information to build all the filename element.

In the two tabs here below you can see the configuration file and the python code.

Parser configuration

# FilenameParser configuration file
#
# General idea:
#
# The file contains the information required to build all the FilenameElement requested by the importer.
#
# Prepare a table for each element and in each table add the regexp, the type and optionally the default.
# Adding the default field, will make the element optional.
#
# Add the table name in the elements array. The order is irrelevant. The division in compulsory and optional elements
# is also irrelevant. It is provided here just for the sake of clarity.
#
# You can have as many element tables as you like, but only the one listed in the elements array will be used to
# configure the Importer.
#
elements = [
    # compulsory elements:
    'sample', 'energy',
    # optional elements:
    'repetition'
]


[sample]
regexp = '[_]*(?P<sample>sample_\d+)[_]*'
type='str'

[energy]
regexp = '[_]*energy_(?P<energy>\d+\.*\d*)[_]*'
type='float'

[repetition]
regexp = '[_]*repetition_(?P<repetition>\d+)[_]*'
type='int'
default = 1

Python test code

filename = 'energy_10.3_sample_12.dat'

parser = FilenameParser('example_conf.toml')
parser.interpret(filename)

assert parser.get_element_value('sample') == 'sample_12'
assert parser.get_element_value('energy') == 10.3
assert parser.get_element_value('repetition') == 1

The configuration file must contain a top level elements array with the name of all the filename elements that are included into the filename. For each value in elements, there must be a dedicated table with the same name containing the definition of the regular expression, the type and optionally the default value.

Important

In TOML configuration files, the use of single quotation marks allows to treat a string as a raw string, that is very important when passing expression containing backslashes. If you prefer to use double quotation marks, then you have to escape all backslashes.

The order of the elements in the elements array is irrelevant and also the fact we have divided them in compulsory and optional is just for the sake of clarity.

In the python tab, you can see how the use of FilenameParser makes your code looking much tidier and easier to read. In this second example, we have removed the optional specification of the repetition element and you can see that the parser is returning the default value of 1 for such element and we have swapped the energy field with the sample name. Moreover, now the energy field is actually a floating number with a decimal figure.

The basic importer

With the power of these two helper classes, building a processor for parsing all our measurement filenames is a piece of a cake. In the processor_library package, you can find a basic implementation of a generic Importer processor, that you can use as a base class for your specific importer.

The idea behind this importer is that you are interested in files inside an input_folder and possibly all its subfolders. You can force the processor to look recursively in all subfolder by turning the processor parameter recursive to True. The last parameter of this processor is the parser_configuration that is the path to the FilenameParser configuration file.

This configuration file is used during the start() method of Importer (or any of its subclasses) to configure its FilenameParser, so that you do not have to worry of this step. In your subclass process method, the filename parser will be straight away ready to use.

Let us have a loop and the ImporterExample processor (available in the examples package) for a concrete implementation of an importer processor.

The ImportExample processor

We will build a subclass of the Importer processor following the for_loop execution workflow.

In the start() method, we will assure that the target table in the database is existing. The definition of the target database Model (InputElement in this example) should be done in a separate database model module to facilitate import statements from other modules as well.

class InputElement(MAFwBaseModel):
    """A model to store the input elements"""

    element_id = AutoField(primary_key=True, help_text='Primary key for the input element table')
    filename = FileNameField(unique=True, checksum_field='checksum', help_text='The filename of the element')
    checksum = FileChecksumField(help_text='The checksum of the element file')
    sample = TextField(help_text='The sample name')
    exposure = FloatField(help_text='The exposure time in hours')
    resolution = IntegerField(default=25, help_text='The readout resolution in µm')

def start(self) -> None:
    """
    The start method.

    The filename parser is ready to use because it has been already configured in the super method.
    We need to be sure that the input table exists, otherwise we create it from scratch.
    """
    super().start()
    self.database.create_tables([InputElement])

In the get_items(), we create a list of all files, in this case matching the fact that the extension is .tif, included in the input_folder. We use the recursive flag to decide if we want to include also all subfolders.

The steering file may contain a GlobalFilter section (see the Filter section) and we use the new_only flag of the filter_register, to further filter the input list from all files that have been already included in the database. It is also important to check that the table is update because you may have an entry pointing to the same filename that in the mean time has been modified. For this purpose the verify_checksum() can be very useful. A more detailed explanation of this function will be presented in a subsequent section.

def get_items(self) -> Collection[Any]:
    r"""Retrieves the list of element to be imported.

    The base folder is provided in the configuration file, along with the recursive flags and all the filter options.

    :return: The list of items full file names to be processed.
    :rtype: list[Path]
    """
    pattern = '**/*tif' if self.recursive else '*tif'
    input_folder_path = Path(self.input_folder)

    file_list = [file for file in input_folder_path.glob(pattern) if file.is_file()]

    # verify the checksum of the elements in the input table. if they are not up to date, then remove the row.
    verify_checksum(InputElement)

    if self.filter_register.new_only:
        # get the filenames that are already present in the input table
        existing_rows = InputElement.select(InputElement.filename).namedtuples()
        # create a set with the filenames
        existing_files = {row.filename for row in existing_rows}
        # filter out the file list from filenames that are already in the database.
        file_list = [file for file in file_list if file not in existing_files]

    return file_list

The ImporterExample follows an implementation approach that tries to maximise the efficiency of the database transaction. It means that instead of making one transaction for each element to be added to the database, all elements are collected inside a list and then transferred to the database with a cumulative transaction at the end of the process itself. This approach, as said, is very efficient from the database point of view, but it can be a bit more demanding from the memory point of view. The best approach depends on the typical number of items to be added for each run and the size of each element.

The implementation of the process() is rather simple and as you can see from the source code it is retrieving the parameter values encoded in the filename via the FilenameParser. If you are wondering why we have assigned the filename to the filename and to the checksum field, have a look at the section about custom fields.

def process(self):
    """
    The process method overload.

    This is where the whole list of files is scanned.

    The current item is a filename, so we can feed it directly to the FilenameParser interpret command, to have it
    parsed. To maximise the efficiency of the database transaction, instead of inserting each file
    singularly, we are collecting them all in a list and then insert all of them in the :meth:`~.finish` method.

    In case the parsing is failing, then the element is skipped and an error message is printed.
    """
    try:
        new_element = {}
        self._filename_parser.interpret(self.item.name)
        new_element['sample'] = self._filename_parser.get_element_value('sample_name')
        new_element['exposure'] = self._filename_parser.get_element_value('exposure')
        new_element['resolution'] = self._filename_parser.get_element_value('resolution')
        new_element['filename'] = self.item
        new_element['checksum'] = self.item
        self._data_list.append(new_element)
    except ParsingError:
        log.critical('Problem parsing %s' % self.item.name)
        self.looping_status = LoopingStatus.Skip

The finish() is where the real database transaction is occurring. All the elements have been collected into a list, so we can use an insert_many statement to transfer them all to the corresponding model in the database. Since we have declared the filename field as unique (this was our implementation decision, but the user is free to relax this requirement), we have added a on_conflict clause to deal with the case the user is updating an entry with the same filename.

Since the super method is printing the execution statistics, we are leaving its call at the end of the implementation.

def finish(self) -> None:
    """
    The finish method overload.

    Here is where we do the database insert with a on_conflict_replace to cope with the unique constraint.
    """
    # we are ready to insert the lines in the database
    InputElement.insert_many(self._data_list).on_conflict_replace(replace=True).execute()

    # the super is printing the statistics, so we call it after the implementation
    super().finish()