User Tools

Site Tools


normalizing-word-spacing

Normalizing Word Spacing Within a Text String

Sometimes from within a program we may want to read data from a text file and collect all the words or text elements in an array. To do this, we usually split up the words or text elements which are separated by spaces. Normally, words are separated by a single space.  However, it is quite possible for there to be non-uniform spacing between the words. In other words, some words may possibly be separated by more than 1 space.

The PHP splitting function will not split up the words properly unless all the words or text elements are separated by exactly 1 space as the PHP example below demonstrates.  Note that the example text string has uneven spacing between the words.

<?php

   $t = 'This  is an   example text line.';

// Collect all the words into the work array and count them.
   $wArray = PReg_Split("[ ]", $t);
   $wCount = count($wArray);

// Print a listing of all words in the work array.
   for($i=0;   $i < $wCount;   $i++)  {print $i.' '.$wArray[$i].'<br>';}

   print "<br>Number of words = $wCount<br>";
   
?>

Executing the code above prints the following output, which is inaccurate due to the uneven spacing between words. There are only 6 words.
Some spaces were also counted as words, which is incorrect.

0 This
1 
2 is
3 an
4 
5 example
6 text
7 line.

Number of words = 8

To account for the uneven spacing, we need to filter out any leading and trailing spaces and the white (extra) spaces which will ensure there is ony 1 space between words before splitting them into an array.  The code below demonstrates such a filter.

<?php

   $t = 'This  is an   example text line.';

// Filter out all leading and trailing spaces and reduce  
// multiple spaces between words to single spaces.
   $t = PReg_Replace("/\s+/", " ", trim($t));

// Collect all the words into the work array and count them.
   $wArray = PReg_Split("[ ]", $t);
   $wCount = count($wArray);

// Print a listing of all words in the work array.
   for($i=0;   $i < $wCount;   $i++)   {print $i.' '.$wArray[$i].'<br>';}

   print "<br>Number of words = $wCount<br>";

?>

Running the code prints the output below, which is correct.  There are 6 words and there should be no blank words in the array.

0 This
1 is
2 an
3 example
4 text
5 line.

Number of words = 6
<?php

   $t = '   This   is an   example        text   line.  '; 
   
   $t = PReg_Replace("/\s+/", " ", trim($t));

   print "'$t'";
?>

Running the code should print the following filtered output without any extra spaces.

'This is an example text line.'
normalizing-word-spacing.txt · Last modified: 2023/01/06 11:17 by jaywiki

Except where otherwise noted, content on this wiki is licensed under the following license: Public Domain
Public Domain Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki