Home Algorithms Strings Extract specified values from strings 29 | 05 | 2024
Feed Display
Latest News
Popular
 Extract specified values from strings
Algorithms - Strings
Written by Jan Schulz
Tuesday, 12 February 2008 00:15

# Extract the nth number from a string

## Objective

Several applications require the extraction of numbers with a certain index from strings of arbitrary length. Typical examples are data strings acquired via RS232 or USB interfaces from external devices. A data set of three lines may look like this:

• `[1]: 0.2321 –1321 00101010 TRUE 0.212`
• `[2]: 0.5 –1324 00101010 TRUE 0.213`
• `[3]: 0.75 1325 00101010 FALSE 0.25`

When the position of the requested value can not be achieved by defined index positions for the first and last character the string needs to be parsed. The easiest way is to start at the first symbol of each line, search and count the delimiters and cut out the requested item. However, a number of scenarios are imaginable when this is insufficient. An example would be unknown, changing or heterogeneous delimiters due to data migration from different sources. Alternatively, if one is interested in the nth value in the string and not in the value behind the nth delimiter or wants to identify whether a string contains at least n numbers. Additionally, one might only be interested in positive values or integers. In this case decimal separators or signs can also be delimiters. Here, an algorithm is presented that parses a string and extracts the nth value, if present. The interpretation of minus signs and decimal separators can be adjusted.

## Approach

To solve this task implementation of an adopted state machine is advised. This machine needs to successively evaluate symbols of a passed string in the normal reading direction. A grammar controls different processing steps, e.g. when a symbol belongs to a number or needs to be interpreted as delimiter. Also, it determines how to identify and count the position of found values and define how negative values or those with decimal separators are treated. These rules determine how to modify the internal states until a stop condition is fulfilled.

To set up a state machine for the identification and extraction of a given number from a string requires a universal set of instructions covering all symbol combinations. Therefore, the different symbols a number consists of need to be identified. Four different types of characters can be distinguished:

1. Ciphers
2. Decimal separators
3. Minus signs
4. Plus signs

Any other symbol delimits numbers. The presence of the the plus sign is not mandatory for positive values. Thus, the list is reduced to three symbols and instruction classes to distinguish while parsing.

When processing decimal separators it is necessary to recognize regional preferences. While countries like Australia, China, Saudi-Arabia, United Kingdom or United States use the decimal point, others like Germany, the Scandinavian countries, Spain or wide parts of South America and Russia make use of the comma. Comma and point can only be used interchangeably as decimal separators when the respective other symbol is not used to separate groups of three digits (thousands) in larger numbers. Consequently, a string containing numbers with both separators can not be extracted correctly.

## Algorithm

Four parameters are necessary for initialisation of the state machine. The first is the string the state machine is to extract a number from. The second is the command which number in the line shall be extracted. Finally, it needs to know how to handle negative values (minus signs) and decimal separators. These variables are defined here as aString, ValueNo, IncludeMinus and IncludeDecSep.

As prerequisite condition aString needs to have at least one symbol, otherwise the algorithm needs not to start. If symbols are present the algorithm starts at the first position of aString, successively evaluates symbols and attaches those being part of a grammatically correct number to a temporary string.

Further the state variables are required, defining processing rules. These are handled and modified internally during sequential processing. First is the result string (Result) that contains the requested value after a successful search. Second is an integer variable that points at the actually processed symbol in the string (StringPos). An integer counter (ActualValue) denotes the actual number of values found to the left of the position of StringPos. Any value extracted form aString is stored in a second variable (SubString). When the machine reaches a final state and the content of SubString is the requested number this content is returned by the function in Result. A Boolean state variable (isExtracted), set when the requested number was found, is used to initiate search termination. A second Boolean variable (wasDecSepFound) displays whether the value in SubString already bears a decimal separator.

A termination state is reached when isExtracted is set or StringPos reaches the last symbol of aString. After termination the content of SubString can be returned as result, if ActualNumber = ValueNo. In this case the nth number was found. It is unsufficient to test for isExtracted as it would be set after StringPos is greater or equal than the length of aString. Thus, the function would fail when the requested number stands at the end of aString. With these outer definitions the algorithmic backbone of the state machine can be planned (Figure 1).

 Figure 1: Algorithmic backbone of a state machine for extraction of numbers from a string. The treatments of symbols that occur in numbers are performed in the sub-routines 1 to 3. Any other symbol does not belong to a number and is a delimiter a priori. The asterisk shows the position from where the states are displayed in the Example section.

## Ciphers

Numbers must contain digits and can not just consist of minus sign or decimal separator alone. These are optional symbols in numbers. For this the counter ActualNumber is only increased whenever the found digit is the first digit in SubString. Four cases can be distinguished for the beginning of a new number. Any digit indicates the beginning of a new number when:

1. SubString is empty, or
2. SubString contains a minus, or
3. SubString contains a decimal separator, or
4. SubString contains a minus followed by a decimal separator.
 Figure 2: Algorithmic treatment when a cipher was found.

At this point it is not necessary to evaluate IncludeMinus and IncludeDecSep, because these symbols will never be included in SubString when the respective flag is not set. As ciphers can stand at every position and consecutive digits belong to the same number no further limitations disallow appending.

## Minus sign

By definition minus signs can only stand at the first position of a number. Thus, any occurrence signalises that maybe so far stored digits in SubString finally represent a number. If ActualValue is equal to ValueNo then SubString contains a copy of the requested number (and) isExtracted will be set to state termination of search.

 Figure 3: Algorithmic treatment when a minus sign was found.

Whenever SubString does not contain the requested number its content can be discarded.

A decision has to be made whether minus signs are delimiters (IncludeMinus = FALSE) or are included by the grammar (IncludeMinus = TRUE). In the first case SubString remains empty at this point, while in the latter the minus sign is the first symbol of any potentially following number. In any case, with the beginning of a new number the state variable wasDecSepFound also needs to be reset.

## Decimal separator

Decimal separators are usually flanked by digits on both sides. Nevertheless, notation allows writing numbers without ciphers on one side of its decimal separator but never both. Numbers like:

• -213,
• 213,
• ,213
• -,213

are true values, although successive or leading zeroes were omitted in this notation. For this reason an increment of ActualNumber was linked to the finding of ciphers (see above in subsection Cipher).

 Figure 4: Algorithmic treatment when a decimal separator was found

First differentiation after finding a decimal separator in aString is whether these symbols represent delimiters or are evaluated as belonging to a number. When point and comma are delimiters they signalise that maybe existing digits in SubString are finalised. If then ActualValue = ValueNo a copy of the requested number is found in SubString like in the above handling of minus signs. Then isExtracted will be set to signalise termination of search. Otherwise the content of SubString is discarded and thus prepared for the next try.

When decimal separators are no delimiters it is important to control whether one of these is already present in SubString. Only one decimal separator per number is valid. When the state variable wasDecSepFound is not set the decimal separator is attached to SubString. Using the variable DecimalSeparator of the operating system ensures proper conversion from string to float later.

When the decimal separator is no delimiter and one of this symbols is already included in SubString again the option arises, whether ActualValue = ValueNo to terminate search. Otherwise Substring will be deleted, a decimal separator is used as first symbol therein and wasDecSepFound set.

## Source code

`Function GetNthValueInString (aString: String; ValueNo: Integer; IncludeMinus: Boolean; IncludeDecSep: Boolean): String;// This function extracts the n'th value given in the variable ValueNo out// of a string. With the boolean variables one can specify whether negative// values or decimal separators are evaluated; first reappearance of a decimal// separator or minus sign within a number reinitialises this char as beginning// of the next number// (c) Dr. Jan Schulz, 2007; www.code10.infoVar StringPos      : Integer;    aChar          : Char;    ActualNumber   : Integer;    SubString      : String;    isExtracted    : Boolean;    wasDecSepFound : Boolean;Begin  // initialise variables  Result         := '';    // return value when requested value is not found  ActualNumber   := 0;     // begin with no value found  StringPos      := 0;     // start at position left of first char  SubString      := '';    // start with an empty substring variable  isExtracted    := False; // requested value is not found so far  wasDecSepFound := False; // no DecimalSeparator is found  // are characters in the string?  If Length (aString) > 0 THen  Begin    Repeat      // move to next character position in the string      Inc (StringPos);      // and get it it      aChar := aString[StringPos];      Case aChar of        '0'..'9' : Begin                     // was a new number found?                     If (SubString = '') Or                        (SubString = '-') Or                        (SubString = DecimalSeparator) Or                        (SubString = ('-' + DecimalSeparator)) THen Inc (ActualNumber);                     // attach cipher to substring                     SubString := SubString + aChar;                   end;        '-'      : Begin                     // is the already found number requested one?                     If ActualNumber = ValueNo THen                     Begin                       isExtracted := True;                     end                     Else                     // otherwise evaluate minus sign                     Begin                       //do we care for minus signs?                       If IncludeMinus THen                       Begin                         // minus sign is first symbol for SubString                         SubString      := '-';                         wasDecSepFound := False;                       end                       Else                       // otherwise beginning of a new number                       Begin                         // reset states and SubString                         SubString      := '';                         wasDecSepFound := False;                       end;                     end;                   end;        '.', ',' : Begin                     // do we care for decimal separators?                     If IncludeDecSep THen                     Begin                       // did we already find a decimal separator?                       If wasDecSepFound THen                       Begin                         // terminate when the requested number was found                         If ActualNumber = ValueNo THen                         Begin                           isExtracted := True;                         end                         Else                         // otherwise interpret char as first of a new number                         Begin                           // attach decimal separator and do not reset the wasDecSepFound                           wasDecSepFound := True;                           SubString      := DecimalSeparator;                         end;                       end                       Else                       // first decimal separator in number is attached and presence notified                       Begin                         wasDecSepFound := True;                         SubString      := SubString + DecimalSeparator;                       end;                     end                     Else                     // decimal separators are delimiters                     Begin                       // is the already found number the requested one?                       If ActualNumber = ValueNo THen                       Begin                         isExtracted := True;                       end                       Else                       // otherwise reset substring for next try                       Begin                         SubString      := '';                         wasDecSepFound := False;                       end;                     end;                   end;        // all other chars are delimiters        Else          // is the already found number the requested one?          If ActualNumber = ValueNo THen          Begin            isExtracted := True;          end          Else          //otherwise reset substring for next try          Begin            SubString      := '';            wasDecSepFound := False;          end;        end;    Until isExtracted Or (StringPos >= Length (aString));  end;  If ActualNumber = ValueNo THen Result := SubString;end; `

## Examples

Observing the variables while parsing the second line of the fictive example above shown in the Objective section above demonstrates internal functioning of the state machine. The state of the variables is displayed after the actual symbol is processed. This point is marked with an asterisk in Figure 1.

Example 1: A call of the function with:

• aString = ’0.5 –1324 00101010 TRUE 0.213’
• ValueNo = 1
• IncludeMinus = TRUE
• IncludeDecSep = TRUE

results stepwise in:

 aChar StringPos ActualNumber isExtracted wasDecSepFound SubString 0 1 1 FALSE FALSE 0 . 2 1 FALSE TRUE 0. 5 3 1 FALSE TRUE 0.5 4 1 TRUE TRUE 0.5

After the fourth step a symbol that does not belong to a number causes evaluation of the boolean expression (ActualNumber = ValueNo). As the outcome is TRUE, isExtracted is set the content of SubString is not discarded and causes the state machine to halt. Thereafter, the same evaluation returns content of substring as result.

Example 2: Calling the function with the parameters:

• aString = ’0.5 –1324 00101010 TRUE 0.213’
• ValueNo = 4
• IncludeMinus = True
• IncludeDecSep = True

successively results in:

 aChar StringPos ActualNumber isExtracted wasDecSepFound SubString 0 1 1 FALSE False 0 . 2 1 FALSE TRUE 0. 5 3 1 FALSE TRUE 0.5 4 1 FALSE False - 5 1 FALSE False - 1 6 2 FALSE False -1 3 7 2 FALSE False -13 2 8 2 FALSE False -132 4 9 2 FALSE False -1324 10 2 FALSE False 0 11 3 FALSE False 0 0 12 3 FALSE False 00 1 13 3 FALSE False 001 0 14 3 FALSE False 0010 1 15 3 FALSE False 00101 0 16 3 FALSE False 001010 1 17 3 FALSE False 0010101 0 18 3 FALSE False 00101010 19 3 FALSE False T 20 3 FALSE False R 21 3 FALSE False U 22 3 FALSE False E 23 3 FALSE False 24 3 FALSE False 0 25 4 FALSE False 0 . 26 4 FALSE True 0. 2 27 4 FALSE True 0.2 1 28 4 FALSE True 0.21 3 29 4 FALSE True 0.213

When StringPos is as high as the number of symbols in aString the algorithm terminates the loop. As already mentioned isExtracted is not set to TRUE at this point. Consequently, the boolean expression (ActualNumber = ValueNo) decides whether the content of SubString is returned as result.

## Homework

The described algorithm works correct under the terms described above. However, the critical reader may be interested in analysing the above explanations in more detail. He will find two sites where state variables are modified although not necessary. Additionally, the code may show some redundancies.

` `
Last Updated on Friday, 18 March 2011 18:26