How to implement Multithreading in Python

Now and then in your programming career, you may hear people throwing terms like parallel threading, parallel processing, multithreading, etc. Apart from being a mouthful, these terms do make you feel like you are building something very complex. They can be intimidating. indeed! But, as true for any complex concept, the easiest way to learn it is to break it down, understand it, and practice it. It’s that simple!

In this guide, I will try to explain the concept of multithreading and multiprocessing and provide you a step-by-step implementation of how you can use multithreading in your Python projects with a simple example that you can replicate. Let’s dive right in!

What is a thread and what is a process?

Let’s avoid technical jargon and use a simple real-world analogy. Let’s say you are driving a car on a nice scenic highway. Now, if you call the act of driving a process then switching gears, hitting breaks, honking, etc. are threads of the process. A thread essentially is a subset of a process, whereas a process is an instance of the program. In our analogy, the highway is our operating system. A program can have multiple instances (processes) running simultaneously – there can be many cars running on a highway! Similarly, a process can also have multiple threads starting and terminating throughout the lifecycle of a process.

Essentially, a thread is a logical segment of a process that’s responsible for completing a specific subset of tasks. This explanation should suffice for our purpose, but if you want to go deeper, difference-between-process-and-thread provides a nice drill-down of the concept.

But why use multithreading?

Wouldn’t it be better if you had twenty hands instead of just two?

Multithreading lets you perform concurrent execution of the parts of your program to improve the performance. Not to mention you can also utilize your resources to the maximum capacity. If you can locate the segments of your program that can be executed concurrently, use multithreading. But remember, everything cannot be parallelized – some things are meant to be done sequentially. You shouldn’t hit the gas and the breaks at the same time, lol!

Although multithreading and multiprocessing seem obviously beneficial in theory, you may encounter outlier cases where it’s not a good idea to use multithreading. You should try to achieve the most optimal balance between the number of parallel threads based on the system load and resource capacity you have. You can do it best by hit-and-trial and experimentations.

Multithreading in Python with an example

As promised, let’s talk about how to implement multithreading in your Python project. We’ll set up a working environment and write a Python module that performs certain I/O operations and then we’ll see how multithreading can help improve the performance, live!

Install Python

If you are trying to understand how multithreading works then I suppose you already have a fair bit of idea on how to install Python, you may already have it installed on your machine. But I want to be as thorough as possible. So if you need help with the Python installation, fear not, our guide on Getting Started with Python will set you up real quick.

Create a virtual environment

We’ll be creating a pip virtual environment for this exercise. Virtual environments give you an isolated space to run your experiments on, and you should always do your experiments safely. Obviously, there is more to virtual environments and you can learn about it more at Conda Virtual Environments – Create, Manage, Deploy. But for now, you can just follow the below instructions,

To create pip environment, run the following command on a terminal of your choice

python -m venv multi-threading

The above command will create a virtual environment named multi-threading. Let’s activate it.

# For Linux and MacOS:
source multi-threading/bin/activate
# For Windows CMD prompt
multi-threading\Scripts\activate.bat

# For Windows PowerShell
multi-threading\Scripts\Activate.ps1

You should see your terminal prompt changed to something like below

activate pip virtual environment on terminal

Example: Single Threaded Python Module

Let’s write a simple Python module that performs the following I/O operations using a single thread.

  1. Read text files available in a source directory one by one
  2. Count the number of lines in each of them
  3. Add the line count at the end of the file
  4. Write the file with line count to a target directory

Single threaded sample program

import os
import logging
from datetime import datetime

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

def append_line_count(file, num_of_lines):
    f = open(file, 'r')
    data = f.readlines()
    f.close()
    data.append("\nLINE_COUNT|" + str(num_of_lines))
    return data

def write_file(file, data):
    f = open(file, 'w')
    for line in data:
        f.write(str(line))
    f.close()

def process_file(file_nm, source_dir, target_dir):
    logging.info("Processing file : {}".format(file_nm))

    abs_src_file_nm = os.path.join(source_dir, file_nm)
    abs_tgt_file_nm = os.path.join(target_dir, file_nm)

    num_of_lines = sum(1 for line in open(abs_src_file_nm))
    data = append_line_count(abs_src_file_nm, num_of_lines)
    write_file(abs_tgt_file_nm, data)

    logging.info("File Processed : {}".format(file_nm))

def main():
    # UPDATE BELOW TWO VARIABLES AS PER YOUR DIRECTORY STRUCTURE
    source_dir = "C:\\ExitCondition\\MultiThreading\\SourceFiles"
    target_dir = "C:\\ExitCondition\\MultiThreading\\TargetFiles"

    logging.info("Extracting source files list from Source Directory {}".format(source_dir))
    source_files = os.listdir(source_dir)

    for file_nm in source_files:
        process_file(file_nm, source_dir, target_dir)

if __name__ == "__main__":
    start_time = datetime.now()
    main()
    run_time = datetime.now() - start_time
    logging.info("Total Execution time : {}".format(run_time))

This is a relatively simple requirement, and the code is not written optimally, but it will suffice for us to understand how we can utilize multithreading in it.

Save the above program into a python file called single-thread-line-count.py. Remember to update the source_dir and target_dir variables in the script.

Create test data files

To run this program, we’ll need to create source and target directories, and place a large number of small text files in the source directory. Don’t worry if you don’t have a large number of test files with you. We’ll create them programmatically.

Create one test file and name it source-file-1.txt. You can put any kind of content in it. Once created, copy it in the source directory, open the bash terminal, and run below for loop.

for ((i=2; i<=100000; i++))
do
  cp source-file-1.txt source-file-$i.txt
done

It will create 100,000 copies of the same file for you. It will take some time, so if you are in a hurry, reduce the 100,000 number to 20,000.

Execute the single threaded program

Once we have 100,000 files in the Source directory, we are all set to run our test. Run the following command on your terminal to execute our Python script.

# python <name of the python module>
python single-thread-line-count.py
single threaded python script execution

It took around 1 minute and 55 seconds to process all the files, not a very attractive performance considering it’s just counting a number of lines in each file and writing it back to the target directory. The execution time may differ for you as your file size and machine spec might be different than mine.

Example: Multi-Threaded Python Module

Now that we know how much time it took for a single-threaded program to finish a simple program, let’s see if we can improve it by using multithreading.

We will import the Thread class from Python’s native module called threading, initialize a thread to process each file, and execute them.

Multi-threaded sample program

import os
import logging
from datetime import datetime
from threading import Thread

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

def append_line_count(file, num_of_lines):
    f = open(file, 'r')
    data = f.readlines()
    f.close()
    data.append("\nLINE_COUNT|" + str(num_of_lines))
    return data

def write_file(file, data):
    f = open(file, 'w')
    for line in data:
        f.write(str(line))
    f.close()

def process_file(file_nm, source_dir, target_dir):
    logging.info("Processing file : {}".format(file_nm))

    abs_src_file_nm = os.path.join(source_dir, file_nm)
    abs_tgt_file_nm = os.path.join(target_dir, file_nm)

    num_of_lines = sum(1 for line in open(abs_src_file_nm))
    data = append_line_count(abs_src_file_nm, num_of_lines)
    write_file(abs_tgt_file_nm, data)

    logging.info("File Processed : {}".format(file_nm))

def main():
    # UPDATE BELOW TWO VARIABLES AS PER YOUR DIRECTORY STRUCTURE
    source_dir = "C:\\ExitCondition\\MultiThreading\\SourceFiles"
    target_dir = "C:\\ExitCondition\\MultiThreading\\TargetFiles"

    logging.info("Extracting source files list from Source Directory {}".format(source_dir))
    source_files = os.listdir(source_dir)
    thread_list = []

    for file_nm in source_files:
        thread = Thread(target=process_file, args=(file_nm, source_dir, target_dir))
        thread_list.append(thread)

    for thread in thread_list:
        thread.start()

    for thread in thread_list:
        thread.join()

if __name__ == "__main__":
    start_time = datetime.now()
    main()
    run_time = datetime.now() - start_time
    logging.info("Total Execution time : {}".format(run_time))

Save the above program into a python file called multi-thread-line-count.py. Remember to update the source_dir and target_dir variables in the script.

If you notice, we didn’t change our code that much! It’s that simple to implement multithreading in Python. But it can get complicated if your source code is dealing with complex requirements. Now, let’s understand what we did here.

Understanding multi-threaded program

We have added few lines in the main function, but the rest of the program is the same. Let’s go through the changes line by line.

thread_list = []

for file_nm in source_files:
    thread = Thread(target=process_file, args=(file_nm, source_dir, target_dir))
    thread_list.append(thread)

First, we are creating an empty thread list, and then we run through the source_files list and create a thread object of the Thread class and insert the thread objects to the thread_list list. While creating the thread object we initialize the Thread class with the function we want to call and the arguments that the function needs. This is essentially telling Python which function needs to run as an isolated thread.

Note that our threads won’t start running at this point in the program yet. We have just instantiated thread objects and added them to a list. Let’s look at the next lines.

for thread in thread_list:
    thread.start()

This is self-explanatory. We are iterating through the thread_list and executing them. This is the point where the process_file function will start executing for each of the source files in parallel.

for thread in thread_list:
    thread.join()

This is the most important part. thread.join() waits for all the threads to finish before moving to the next part of the program. Without this step, the program would continue to keep execute the next set of instructions.

Execute the multi-threaded program

I hope you still have those 100,000 files in the source directory, if not, place them in the source directory again. Run the following command on your terminal to execute our Python script.

# python <name of the python module>
python multi-thread-line-count.py
multithreading, multi-threading python program execution

It took 1 minute and 15 seconds! That’s a 35% performance improvement! That’s something to brag about!

Hopefully, you were able to run through the entire exercise seamlessly, and this would help you implement multithreading in your projects. If you run into any issue, please post it in the comments below, we’ll try to assist you.

There are many advanced options that you can use with multithreading, do experiment with them. We’ll post more articles on multithreading in our future posts. Keep watching our space!

I hope this was useful to you, please share it with your friends and colleagues who can benefit from it, and also let us know your feedback in the comments below. If you’d like to get regular updates on all of our new posts, do subscribe.

As always, Happy Learning!