Extract specified values from strings Print
Algorithms - Strings
Written by Jan Schulz   
Tuesday, 12 February 2008 00:15

Extract the nth number from a string

 

Objective

Several applications require the extraction of numbers with a certain index from strings of arbitrary length. Typical examples are data strings acquired via RS232 or USB interfaces from external devices. A data set of three lines may look like this:

  • [1]: 0.2321 –1321 00101010 TRUE 0.212
  • [2]: 0.5 –1324 00101010 TRUE 0.213
  • [3]: 0.75 1325 00101010 FALSE 0.25

When the position of the requested value can not be achieved by defined index positions for the first and last character the string needs to be parsed. The easiest way is to start at the first symbol of each line, search and count the delimiters and cut out the requested item. However, a number of scenarios are imaginable when this is insufficient. An example would be unknown, changing or heterogeneous delimiters due to data migration from different sources. Alternatively, if one is interested in the nth value in the string and not in the value behind the nth delimiter or wants to identify whether a string contains at least n numbers. Additionally, one might only be interested in positive values or integers. In this case decimal separators or signs can also be delimiters. Here, an algorithm is presented that parses a string and extracts the nth value, if present. The interpretation of minus signs and decimal separators can be adjusted.

 

Approach

To solve this task implementation of an adopted state machine is advised. This machine needs to successively evaluate symbols of a passed string in the normal reading direction. A grammar controls different processing steps, e.g. when a symbol belongs to a number or needs to be interpreted as delimiter. Also, it determines how to identify and count the position of found values and define how negative values or those with decimal separators are treated. These rules determine how to modify the internal states until a stop condition is fulfilled.

To set up a state machine for the identification and extraction of a given number from a string requires a universal set of instructions covering all symbol combinations. Therefore, the different symbols a number consists of need to be identified. Four different types of characters can be distinguished:

  1. Ciphers
  2. Decimal separators
  3. Minus signs
  4. Plus signs

Any other symbol delimits numbers. The presence of the the plus sign is not mandatory for positive values. Thus, the list is reduced to three symbols and instruction classes to distinguish while parsing.

When processing decimal separators it is necessary to recognize regional preferences. While countries like Australia, China, Saudi-Arabia, United Kingdom or United States use the decimal point, others like Germany, the Scandinavian countries, Spain or wide parts of South America and Russia make use of the comma. Comma and point can only be used interchangeably as decimal separators when the respective other symbol is not used to separate groups of three digits (thousands) in larger numbers. Consequently, a string containing numbers with both separators can not be extracted correctly.

 

Algorithm

Four parameters are necessary for initialisation of the state machine. The first is the string the state machine is to extract a number from. The second is the command which number in the line shall be extracted. Finally, it needs to know how to handle negative values (minus signs) and decimal separators. These variables are defined here as aString, ValueNo, IncludeMinus and IncludeDecSep.

As prerequisite condition aString needs to have at least one symbol, otherwise the algorithm needs not to start. If symbols are present the algorithm starts at the first position of aString, successively evaluates symbols and attaches those being part of a grammatically correct number to a temporary string.

Further the state variables are required, defining processing rules. These are handled and modified internally during sequential processing. First is the result string (Result) that contains the requested value after a successful search. Second is an integer variable that points at the actually processed symbol in the string (StringPos). An integer counter (ActualValue) denotes the actual number of values found to the left of the position of StringPos. Any value extracted form aString is stored in a second variable (SubString). When the machine reaches a final state and the content of SubString is the requested number this content is returned by the function in Result. A Boolean state variable (isExtracted), set when the requested number was found, is used to initiate search termination. A second Boolean variable (wasDecSepFound) displays whether the value in SubString already bears a decimal separator.

A termination state is reached when isExtracted is set or StringPos reaches the last symbol of aString. After termination the content of SubString can be returned as result, if ActualNumber = ValueNo. In this case the nth number was found. It is unsufficient to test for isExtracted as it would be set after StringPos is greater or equal than the length of aString. Thus, the function would fail when the requested number stands at the end of aString. With these outer definitions the algorithmic backbone of the state machine can be planned (Figure 1).

 

 

Figure 1: Algorithmic backbone of a state machine for extraction of numbers from a string. The treatments of symbols that occur in numbers are performed in the sub-routines 1 to 3. Any other symbol does not belong to a number and is a delimiter a priori. The asterisk shows the position from where the states are displayed in the Example section.

 

Ciphers

Numbers must contain digits and can not just consist of minus sign or decimal separator alone. These are optional symbols in numbers. For this the counter ActualNumber is only increased whenever the found digit is the first digit in SubString. Four cases can be distinguished for the beginning of a new number. Any digit indicates the beginning of a new number when:

  1. SubString is empty, or
  2. SubString contains a minus, or
  3. SubString contains a decimal separator, or
  4. SubString contains a minus followed by a decimal separator.

Figure 2: Algorithmic treatment when a cipher was found.

At this point it is not necessary to evaluate IncludeMinus and IncludeDecSep, because these symbols will never be included in SubString when the respective flag is not set. As ciphers can stand at every position and consecutive digits belong to the same number no further limitations disallow appending.

 

Minus sign

By definition minus signs can only stand at the first position of a number. Thus, any occurrence signalises that maybe so far stored digits in SubString finally represent a number. If ActualValue is equal to ValueNo then SubString contains a copy of the requested number (and) isExtracted will be set to state termination of search.

Figure 3: Algorithmic treatment when a minus sign was found.

Whenever SubString does not contain the requested number its content can be discarded.

A decision has to be made whether minus signs are delimiters (IncludeMinus = FALSE) or are included by the grammar (IncludeMinus = TRUE). In the first case SubString remains empty at this point, while in the latter the minus sign is the first symbol of any potentially following number. In any case, with the beginning of a new number the state variable wasDecSepFound also needs to be reset.

 

Decimal separator

Decimal separators are usually flanked by digits on both sides. Nevertheless, notation allows writing numbers without ciphers on one side of its decimal separator but never both. Numbers like:

  • -213,
  • 213,
  • ,213
  • -,213

are true values, although successive or leading zeroes were omitted in this notation. For this reason an increment of ActualNumber was linked to the finding of ciphers (see above in subsection Cipher).

Figure 4: Algorithmic treatment when a decimal separator was found

First differentiation after finding a decimal separator in aString is whether these symbols represent delimiters or are evaluated as belonging to a number. When point and comma are delimiters they signalise that maybe existing digits in SubString are finalised. If then ActualValue = ValueNo a copy of the requested number is found in SubString like in the above handling of minus signs. Then isExtracted will be set to signalise termination of search. Otherwise the content of SubString is discarded and thus prepared for the next try.

When decimal separators are no delimiters it is important to control whether one of these is already present in SubString. Only one decimal separator per number is valid. When the state variable wasDecSepFound is not set the decimal separator is attached to SubString. Using the variable DecimalSeparator of the operating system ensures proper conversion from string to float later.

When the decimal separator is no delimiter and one of this symbols is already included in SubString again the option arises, whether ActualValue = ValueNo to terminate search. Otherwise Substring will be deleted, a decimal separator is used as first symbol therein and wasDecSepFound set.

 

Source code

Function GetNthValueInString (aString: String; ValueNo: Integer; IncludeMinus: Boolean; IncludeDecSep: Boolean): String;
// This function extracts the n'th value given in the variable ValueNo out
// of a string. With the boolean variables one can specify whether negative
// values or decimal separators are evaluated; first reappearance of a decimal
// separator or minus sign within a number reinitialises this char as beginning
// of the next number
// (c) Dr. Jan Schulz, 2007; www.code10.info
Var StringPos : Integer;
aChar : Char;
ActualNumber : Integer;
SubString : String;
isExtracted : Boolean;
wasDecSepFound : Boolean;
Begin
// initialise variables
Result := ''; // return value when requested value is not found
ActualNumber := 0; // begin with no value found
StringPos := 0; // start at position left of first char
SubString := ''; // start with an empty substring variable
isExtracted := False; // requested value is not found so far
wasDecSepFound := False; // no DecimalSeparator is found

// are characters in the string?
If Length (aString) > 0 THen
Begin
Repeat
// move to next character position in the string
Inc (StringPos);
// and get it it
aChar := aString[StringPos];

Case aChar of
'0'..'9' : Begin
// was a new number found?
If (SubString = '') Or
(SubString = '-') Or
(SubString = DecimalSeparator) Or
(SubString = ('-' + DecimalSeparator)) THen Inc (ActualNumber);
// attach cipher to substring
SubString := SubString + aChar;
end;

'-' : Begin
// is the already found number requested one?
If ActualNumber = ValueNo THen
Begin
isExtracted := True;
end
Else
// otherwise evaluate minus sign
Begin
//do we care for minus signs?
If IncludeMinus THen
Begin
// minus sign is first symbol for SubString
SubString := '-';
wasDecSepFound := False;
end
Else
// otherwise beginning of a new number
Begin
// reset states and SubString
SubString := '';
wasDecSepFound := False;
end;
end;
end;

'.', ',' : Begin
// do we care for decimal separators?
If IncludeDecSep THen
Begin
// did we already find a decimal separator?
If wasDecSepFound THen
Begin
// terminate when the requested number was found
If ActualNumber = ValueNo THen
Begin
isExtracted := True;
end
Else
// otherwise interpret char as first of a new number
Begin
// attach decimal separator and do not reset the wasDecSepFound
wasDecSepFound := True;
SubString := DecimalSeparator;
end;
end
Else
// first decimal separator in number is attached and presence notified
Begin
wasDecSepFound := True;
SubString := SubString + DecimalSeparator;
end;
end
Else
// decimal separators are delimiters
Begin
// is the already found number the requested one?
If ActualNumber = ValueNo THen
Begin
isExtracted := True;
end
Else
// otherwise reset substring for next try
Begin
SubString := '';
wasDecSepFound := False;
end;
end;
end;

// all other chars are delimiters
Else
// is the already found number the requested one?
If ActualNumber = ValueNo THen
Begin
isExtracted := True;
end
Else
//otherwise reset substring for next try
Begin
SubString := '';
wasDecSepFound := False;
end;
end;

Until isExtracted Or (StringPos >= Length (aString));
end;

If ActualNumber = ValueNo THen Result := SubString;
end;

Examples

Observing the variables while parsing the second line of the fictive example above shown in the Objective section above demonstrates internal functioning of the state machine. The state of the variables is displayed after the actual symbol is processed. This point is marked with an asterisk in Figure 1.

Example 1: A call of the function with:

  • aString = ’0.5 –1324 00101010 TRUE 0.213’
  • ValueNo = 1
  • IncludeMinus = TRUE
  • IncludeDecSep = TRUE

results stepwise in:

aChar

StringPos

ActualNumber

isExtracted

wasDecSepFound

SubString

0

1

1

FALSE

FALSE

0

.

2

1

FALSE

TRUE

0.

5

3

1

FALSE

TRUE

0.5

4

1

TRUE

TRUE

0.5

After the fourth step a symbol that does not belong to a number causes evaluation of the boolean expression (ActualNumber = ValueNo). As the outcome is TRUE, isExtracted is set the content of SubString is not discarded and causes the state machine to halt. Thereafter, the same evaluation returns content of substring as result.

 

Example 2: Calling the function with the parameters:

  • aString = ’0.5 –1324 00101010 TRUE 0.213’
  • ValueNo = 4
  • IncludeMinus = True
  • IncludeDecSep = True

successively results in:

aChar

StringPos

ActualNumber

isExtracted

wasDecSepFound

SubString

0

1

1

FALSE

False

0

.

2

1

FALSE

TRUE

0.

5

3

1

FALSE

TRUE

0.5

4

1

FALSE

False

-

5

1

FALSE

False

-

1

6

2

FALSE

False

-1

3

7

2

FALSE

False

-13

2

8

2

FALSE

False

-132

4

9

2

FALSE

False

-1324

10

2

FALSE

False

0

11

3

FALSE

False

0

0

12

3

FALSE

False

00

1

13

3

FALSE

False

001

0

14

3

FALSE

False

0010

1

15

3

FALSE

False

00101

0

16

3

FALSE

False

001010

1

17

3

FALSE

False

0010101

0

18

3

FALSE

False

00101010

19

3

FALSE

False

T

20

3

FALSE

False

R

21

3

FALSE

False

U

22

3

FALSE

False

E

23

3

FALSE

False

24

3

FALSE

False

0

25

4

FALSE

False

0

.

26

4

FALSE

True

0.

2

27

4

FALSE

True

0.2

1

28

4

FALSE

True

0.21

3

29

4

FALSE

True

0.213

When StringPos is as high as the number of symbols in aString the algorithm terminates the loop. As already mentioned isExtracted is not set to TRUE at this point. Consequently, the boolean expression (ActualNumber = ValueNo) decides whether the content of SubString is returned as result.

 

Homework

The described algorithm works correct under the terms described above. However, the critical reader may be interested in analysing the above explanations in more detail. He will find two sites where state variables are modified although not necessary. Additionally, the code may show some redundancies.

 
Last Updated on Friday, 18 March 2011 18:26