Blog > Parallel Word Generator and Word Counter

Parallel Word Generator and Word Counter

Exploring Concurrency in Daily Life and Parallel Programming: Word Generation and Counting with MPI and Modified MapReduce

Published: March 18, 2022

Last updated: March 19, 2024

Summary

Concurrency is actually an action that we frequently practice in our daily lives. For example, holding a water bottle in one hand while opening a door with the other. Parallel Programming is merely the realization of concurrent operations on a computer. This paper discusses the development process of a Word Generator using the MPI library and the methods for developing a Word Counter by modifying the MapReduce algorithm according to my own needs. Evaluations have shown that parallelization becomes detrimental in terms of time after a certain point, which also validates Amdahl's law.

1. Introduction

Concurrency is actually an action that we frequently practice in our daily lives. For example, holding a water bottle in one hand while opening a door with the other. Parallel Programming is merely the realization of concurrent operations on a computer. To perform concurrent operations, you first need a CPU with an appropriate architecture. Most processors today are capable of performing concurrent operations. The processor you use at home utilizes the MIMD architecture seen in Figure 1.1.

2. MPI

MPI (Message Passing Interface) is a computer communication protocol. It is a de facto standard protocol used for communication between nodes running parallel programs in a distributed memory system. It supports both point-to-point and collective communication. MPI provides an application programming interface for sending messages along with protocol and semantic specifications on how its features should behave in any given application. The goals of MPI are high performance, scalability, and portability. Despite not being officially approved by any organization, MPI has become a de facto standard in high-performance computing (Wikipedia, 2022).

3. MapReduce

MapReduce is a programming model and associated implementation for processing and generating large datasets distributed across a cluster. MapReduce is an algorithm presented for distributing data and then reducing it back afterwards. Google also provides a MapReduce framework for Big Data under the same name. However, this paper does not utilize this framework. The MapReduce algorithm has been optimized and developed entirely according to the needs of the tool. Figure 3.1 illustrates a MapReduce algorithm used to count data within a document roughly. No parallel processing is used in the visualization, meaning MapReduce is not exclusively used for parallel processing.

Figure 3.1 MapReduce pseudo code (Wikipedia, 2022)

4. Word Generator

The word generator is a tool that creates words by concatenating randomly selected characters from the 26 characters in the English alphabet, based on the given number of processes and the minimum and maximum word length constraints determined by the user. These words are then saved to a file with spaces between them. The pseudo code for the word generator can be seen in below.


MPINIT()
MPI_Comm_ran()
MPI_Comm_size(N)

create a new Generator
if world_rank == 0:
    Calculate the number of tasks to be sent to each process
    Calculate how much data will be received from each process for scatter
    Create a receiver buffer for scatter

distribute the generated data
wait until all data is distributed at the barrier

Create a character buffer to hold words
Generate words and assign them to the character buffer

if world_rank == 0:
    Create a buffer to collect all words

Collect data from all processes into the buffer
Wait at the barrier until all data is collected

if world_rank == 0: write all data to a file
    Print the time taken for execution

MPI_Finalize

Release the buffer you created

The word generator is implemented using the MPI library in C++. Roughly, the process involves dividing the desired number of words by the number of processes and transmitting the number of words each process needs to generate via buffers. After each process produces its share of words, all words are gathered within the root process. The gathered words are then saved to a file with spaces between them. The file structure of the word generator tool is illustrated in Figure 4.2.

Figure 4.2 Word Generator file structure

The /bin directory contains executable files generated.
The /doc directory contains necessary documents.
The /include directory contains header files.
The /lib directory contains compiled C++ header files of the written source code as a library.
The /results directory contains generated datasets.
The /src directory contains source codes of header files.

5. Word Counter

The word counter counts the occurrences of words in a given file using "n" number of processes. After counting, it writes these words to a file in YAML format with key-value pairs. The pseudo code for the word counter can be seen in below.


create a process queue

function read_from_file:
    buff = 64
    read "buff" bytes of data and extract words from it
    store completely read words in a hashmap
    store incompletely read words in an array

function map:
    determine how much data the process will read
    launch a goroutine and call the "read_from_file" method within it
    enqueue the process into the process queue

function reduce:
    while:
        wait for all processes to finish reading from the queue

    words = {}
    for p in processes:
        store incompletely read words in temporary memory and \
            merge them with those coming from the next process

    for p in processes:
        transfer completely read data in the process to the 'words' variable

map()
reduce()
save words to a file

The Word Counter MapReduce algorithm has been developed in the Go language. In the word counter, each process reads from the file in 64-byte buffers. Each process starts reading from where the previous process left off, and this sequence continues. The number of bytes each process will read is determined by the size of the file. Each process reads n bytes of the file size/n.

The reading process is initiated concurrently with the mapping process. Since the exact points where words are divided are not known, there may be a problem of incomplete reading of words. If a word can be completely read, it is transferred within a map. If a word is partially read, it is combined with the partially read words from the previous or next processes to obtain the complete word, and this is done in the reduce operation.

After the operations are completed, a file name is created based on the current hour in YAML format and saved to the file.

The file structure of the word counter tool is illustrated in Figure 5.2.

The /bin directory contains executable files generated.
The /doc directory contains necessary documents.
The /results directory contains generated datasets.
The /. directory contains Go package codes.

6. Project Structure

The project consists of two tools: the word counter and the word generator. The word generator creates a file based on specified parameters, while the word counter counts the occurrences of words in the latest file generated by the word generator and creates a new file. All operations are performed in parallel. The file structure is shown in Figure 6.1.

The /counter directory contains files related to the word counter, including codes and documents.
The /generator directory contains files related to the word generator, including codes and documents.
The Makefile contains short scripts to facilitate usage.
visualization.py contains a script that allows you to perform performance benchmarks on the counter or generator tool within the range of 1-100 processes.

7. Requirements and Installation

The recommended operating system is Ubuntu 20.04. All requirements, installation, and execution processes are explained based on this operating system.

Counter

Go -> ^1.18.3

Generator

C++ -> ^11
OpenMPI -> ^4.0.3

Make

make -> ^4.2.1

First, check for updates and perform necessary updates:

sudo apt-get update && sudo apt-get upgrade

C++ should be available on Ubuntu 20.04. You can check it using the following command:

g++ --version

If you encounter an error, you can install it using:

sudo apt-get install build-essential

You can install the MPI library on your computer using the following command:

sudo apt-get install mpich openmpi-bin

After successful installation, you can test it using:

mpic++ --version

Navigate to the /tmp directory and download version 1.18.* to your machine:

cd /tmp && wget https://go.dev/dl/go1.18.3.linux-amd64.tar.gz

After successful download, delete any existing Go files and copy the new one to the /usr/local directory:

sudo rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.18.3.linux-amd64.tar.gz

Once copied, add the directory to your $PATH. To make it permanent, paste it into the rc file of your shell:

For Bash: $HOME/.bashrc
For Zsh: $HOME/.zshrc

export PATH=$PATH:/usr/local/go/bin

You can now test Go usage via the following command:

go version

Make should be installed by default on Ubuntu 20.04. You can check it using the following command:

make --version

If it's not installed, you can do so using:

sudo apt-get install make

If all steps were completed without errors, the installations are successful, and you can proceed to execution.

8. Makefile

To facilitate execution, a Makefile has been created in the project's root directory. With the Makefile, both tools can be executed using environment variables and separators without the need to write complex commands. The Makefile provides you with a CLI command set for easy usage. Available variables:

DATASET_SIZE: min=1, max=10M, default=50K
MAX_STR_LEN: min=2, max=100, default=10
MIN_STR_LEN: min=1, max=100, default=2
GENERATED_FILE_PATH: Optional, must be a valid file path

WORLD_SIZE

Specifies the number of processes to be used. Applicable to both Generator and Counter. Constraints: 1 <= X <= 100, default=1

Must not be smaller than DATASET_SIZE.

DATASET_SIZE

Specifies the number of words to be generated. A variable used only by the Generator. Constraints: 1 <= X <= 1000000, default=50000

Must not be larger than WORLD_SIZE.

MAX_STR_LEN

Specifies the maximum length of the generated words. A variable used only by the Generator. Constraints: 2 <= X <= 100, default=10

Must not be smaller than MIN_STR_LEN.

MIN_STR_LEN

Specifies the minimum length of the generated words. A variable used only by the Generator. Constraints: 1 <= X <= 100, default=2

Must not be larger than MAX_STR_LEN.

GENERATED_FILE_PATH

Can be used to specify the path of the file from which words should be counted. Optional. The default value is the last generated file by the Generator. A variable used only by the Counter. Constraints: X > 0, default=last generated file

9. Execution

If you want to compile both applications to their final version and run them with default values, you can execute the following command in the root directory of the repository:

make

To run applications with custom values:

make MAX_STR_LEN=15 MIN_STR_LEN=5 WORLD_SIZE=10 DATASET_SIZE=20

To run applications with custom values (assuming they have been previously compiled):

make runner MAX_STR_LEN=15 MIN_STR_LEN=5 WORLD_SIZE=10 DATASET_SIZE=20

To compile applications only:

make builder

To compile only the Generator application:

make build_generator

To compile only the Counter application:

make build_counter

To run only the Generator application (assuming it has been previously compiled):

make run_generator MAX_STR_LEN=15 MIN_STR_LEN=5 WORLD_SIZE=10 DATASET_SIZE=20

To compile and run only the Generator application:

make BR_generator MAX_STR_LEN=15 MIN_STR_LEN=5 WORLD_SIZE=10 DATASET_SIZE=20

To run only the Counter application (assuming it has been previously compiled):

make run_counter WORLD_SIZE=10

To run the Counter application with a specified file path (assuming it has been previously compiled):

make run_counter WORLD_SIZE=5 GENERATED_FILE_PATH=$HOME/MPI-words/generator/results/2022_25_05-17_42_11.txt

To compile and run only the Counter application:

make BR_counter WORLD_SIZE=10

Evaluation

Device Information:

OS: Ubuntu 20.04.4 LTS x86_64
Kernel: 5.13.0-44-generic
Shell: zsh 5.8
Terminal: gnome-terminal
CPU: 11th Gen Intel i5-11400H
Memory: 16GB DDR4 3200 MHz

Measurements vary depending on the differences between C++ and Go languages, the parameters given to executable files, and the states of my computer at the time of the measurements.

Word Generator

Examine the performance over time of a generator producing 1,000,000 randomly generated words of 100 characters, varying the number of processes from 1 to 100 in Figure 10.1.

As a result, after parallelizing with 17 processes, increasing parallelism is detrimental at this parameter level.

Word Counter

Examine the performance over time of a counter tallying and saving words from a file containing 1,000,000 words of 100 characters in YAML format, with the number of processes ranging from 1 to 100 in Figure 10.2.

Unlike the generator graph, there is no regular increase. With these parameters, after being run in parallel with 7 processes this time, it can be seen.

References

Sources

DISCLAIMER

This document is translated from Turkish to English using ChatGPT. The original document is written by the in Turkish. The translation may contain errors and inaccuracies.