mafw.processor_library.importer
Provides a basic element importer.
The first step in the setting up of the analytical framework of a data analysis procedure is to add new elements to the input set. These elements can encompass a wide range of data, including results from experiments or simulations, as well as information gathered through from webscraping or other data sources.
Independently of where the data are coming from, one common task is to add those data to your collection inside the DB, so that the following analytical steps know where the data are and what they are.
This module provides a generic processor that the user can subclass and customize to their needs to import input files. Thanks to a smart filename parsing, other information can be extracted from the filename itself and used to populate additional columns in the dedicated database table.
Classes
|
Helper class for the definition of filename element. |
|
Helper class to interpret all elements in a filename. |
|
Importer is the base class for importing elements in the Database structure. |
- class mafw.processor_library.importer.FilenameElement(name: str, regex: str | ~re.Pattern[str], value_type: type = <class 'str'>, default_value: str | int | float | None = None)[source]
Bases:
objectHelper class for the definition of filename element.
While importing an element to the DB, several parameters can be retrieved directly from the filename. The role of this helper class is to provide an easy way to define patterns in the filename representing a specific piece of information that has to be transferred to the DB.
The element is characterized by a name, a regular expression, the expected python type for the parsed value and an optional default value. The regular expression should contain a named group in the form
?P<name>where name is matching the FilenameElement name.To make a filename element optional, it is enough to provide a default value different from None. In this case, if the parsing is failing, then the default value will be returned.
Constructor parameters:
- Parameters:
name (str) – The name of the filename element
regex (str | re.Pattern[str]) – The regular expression associated to this filename element. It must contain a named group in the form ?P<name>.
value_type (type, Optional) – The type the output value should be converted into. It defaults to str.
default_value (Any, Optional) – The default value to assign to the filename element if the pattern is not found in the filename. It defaults to None
- classmethod _get_value_type(type_as_string: str) type[source]
Returns the value type.
This method is used by the class method constructor to check if the user provided type in the form of a string is a valid one.
If so, then the corresponding python type is returned, otherwise a ValueError exception is raised.
- Parameters:
type_as_string (str) – The type of the value as a string.
- Returns:
The corresponding python type.
- Return type:
type
- Raises:
ValueError – if type_as_string is not any of the acceptable type for the value.
- classmethod from_dict(name: str, info_dict: dict[str, str | int | float]) FilenameElement[source]
Generates a FilenameElement starting from external information stored in a dictionary.
- info_dict should contain the following three keys:
regexp: the Regular expression for the element search.
type: a string with the python type name (int, float, str) for the element conversion.
default (optional): a default value.
- Parameters:
name (str) – The name of the element.
info_dict (dict) – The dictionary with the required parameters for the class constructor.
- Returns:
An instance of FilenameElement.
- Return type:
- _validate_default_type() None[source]
Checks that the default has a type matching the value type. The check is actually performed if and only if a default value is provided. If None, then the validation is skipped.
- Raises:
TypeError – if the default value type does not match the declared value type.
- _validate_regexp() None[source]
Checks if the regular expression contains a named group named after the element itself.
- Raises:
ValueError – if the regular expression is not valid.
- reset() None[source]
Resets the value to the default value.
Remember: that the default value is None for compulsory elements.
- search(string: str | Path) None[source]
Searches the string for the regular expression.
If the pattern is found in the string, then the matched value is transferred to the FilenameElement value.
Note
This method is not returning the match value. It is only searching the input string for the registered pattern. If the pattern is found, then the user can retrieve the matched value by invoking the
value()method. If the pattern is not found, thevalue()will return either None, for a compulsory element, or the default value for an optional one.- Parameters:
string (str | Path) – The string to be parsed. In most of the case, this is a filename, that is why the method is accepting also a Path type.
- property is_found: bool
Returns if the file element is found
- property is_optional: bool
Returns if the element is optional
- property name: str
Returns the class name
- property pattern: str | bytes
Returns the regular expression pattern
- type_lut: dict[str, type[str] | type[int] | type[float]] = {'float': <class 'float'>, 'int': <class 'int'>, 'str': <class 'str'>}
A lookup table for converting type definition as string into python types
- property value: str | int | float | None
Returns the class value
- class mafw.processor_library.importer.FilenameParser(configuration_file: str | Path, filename: str | Path | None = None)[source]
Bases:
objectHelper class to interpret all elements in a filename.
Inside a filename, there might be many elements containing information about the item that must be stored in the DB. This class will parse the filename, and after a successful identification of them all, it will make them available for the importer class to fill in the fields in the database.
The
FilenameParserneeds to be configured to be able to recognise each element in the filename. Such configuration is saved in a toml file. An example of such a configuration is providedhere.Each element must start with its name and a valid regular expression and a python type (in string). If an element is optional, then a default value must be provided as well.
After the configuration, the filename can be interpreted invoking the
interpret()method. This will perform the actual parsing of the filename. If an error occurs during the parsing process, meaning that a compulsory element is not found, then theParsingErrorexception will be raised. So remember to protect the interpretation with a try/except block.The value of each file element is available upon request. The user has simply to invoke the
get_element_value()providing the element name.Constructor parameters:
- Parameters:
filename (str | Path) – The filename to be interpreted.
configuration_file (str | Path) – The configuration file for the interpreter.
- Raises:
ParserConfigurationError – If the configuration file is invalid.
- _parser_configuration() None[source]
Loads the parser configuration, generates the required FilenameElement and adds them element dictionary.
The configuration file is stored in a TOML file.
This private method is automatically invoked by the class constructor.
- Raises:
ParserConfigurationError – if the provided configuration file is invalid.
- get_element(element_name: str) FilenameElement | None[source]
Gets the FilenameElement named element_name
- get_element_value(element_name: str) str | int | float | None[source]
Gets the value of the FilenameElement named element_name.
It is equivalent to call
self.get_element('element_name').value
- interpret(filename: str | Path | None = None) None[source]
Performs the interpretation of the filename.
The filename can be provided either as constructor argument or here as an argument. If both, then the local one will have the precedence.
- Raises:
ParsingError – if a compulsory element is not found in the filename
MissingAttribute – if no filename has been specified.
- _configuration_file
The configuration file for the interpreter.
- _element_dict: dict[str, FilenameElement]
A dictionary with all the FilenameElement
- _filename
The filename for this interpreter. If None, it should be specified before interpretation.
- property elements: dict[str, FilenameElement]
Returns the filename element dictionary
- class mafw.processor_library.importer.Importer(*args: Any, **kwargs: Any)[source]
Bases:
ProcessorImporter is the base class for importing elements in the Database structure.
It provides an easy skeleton to be subclassed by a more specific importer related to a certain project.
It can be customised with three processor parameters:
The
parser_configuration: the path to the configuration file for theFilenameParser.The
input_folder: the path where the input files to be imported are.The
recursiveflag: to specify if all subfolders should be also scanned.
For a concrete implementation, have a look at the
ImporterExamplefrom the example library.Processor parameters
input_folder: The input folder from where the images have to be imported. (default: ‘/tmp/mafw-docs-d16l86xo/v1.4.0’)
parser_configuration: The path to the TOML file with the filename parser configuration (default: ‘parser_configuration.toml’)
recursive: Extend the search to sub-folder (default: True)
Constructor parameters
- Parameters:
name (str, Optional) – The name of the processor. If None is provided, the class name is used instead. Defaults to None.
description (str, Optional) – A short description of the processor task. Defaults to the processor name.
config (dict, Optional) – A configuration dictionary for this processor. Defaults to None.
looper (LoopType, Optional) – Enumerator to define the looping type. Defaults to LoopType.ForLoop
user_interface (UserInterfaceBase, Optional) – A user interface instance to be used by the processor to interact with the user.
timer (Timer, Optional) – A timer object to measure process duration.
timer_params (dict, Optional) – Parameters for the timer object.
database (Database, Optional) – A database instance. Defaults to None.
database_conf (dict, Optional) – Configuration for the database. Default to None.
remove_orphan_files (bool, Optional) – Boolean flag to remove files on disc without a reference to the database. See Standard tables and
_remove_orphan_files(). Defaults to Truekwargs – Keyword arguments that can be used to set processor parameters.
- format_progress_message() None[source]
Customizes the progress message with information about the current item.
The user can overload this method in order to modify the message being displayed during the process loop with information about the current item.
The user can access the current value, its position in the looping cycle and the total number of items using
Processor.item,Processor.i_itemandProcessor.n_item.
- start() None[source]
The start method.
The filename parser is created using the provided configuration file.
- Raises:
ParserConfigurationError – If the configuration file is not valid.
- _filename_parser: FilenameParser
The filename parser instance