Improving Performance with Multi-Threading in Python
Development
Usage of "for" Loop
The for loop is the most commonly used loop statement in Python, and its syntax is as follows:
The Problem
The issue with the for loop is that it executes sequentially, which means that the next loop will only be executed after the code in the for loop has finished executing. If the code inside the for loop takes a long time to execute, then the for loop will take a long time to execute as well.
When web scraping, we often encounter situations where we need to make multiple requests to an API by changing an index parameter, and then concatenate the data for analysis.
However, network access has always been a bottleneck in our program execution speed. If we use a for loop to get data, we will waste a lot of time.
Enabling Multi-threading
Fortunately, Python provides support for multi-threading, which allows us to open multiple threads and simultaneously access multiple URLs for data retrieval, thereby saving a lot of time. Using the following syntax, we can perfectly replace the for loop.
Principles and Considerations for Speeding up with Multi-threading
It should be noted that the reason why multi-threading can improve the execution efficiency in this example is not because we are squeezing the performance of our computer, but because the CPU is actually idle when downloading data, and it is waiting for the download task to complete before executing the contents of the next loop (starting the next download). In short, what slows down our execution speed is the network, not the CPU. Therefore, if your task is mainly local computing tasks, using multi-threading may not garanteed to improve your execution speed.
In addition, making a huge amount of requests to an API at the same time is not a good idea. We can control the maximum number of threads to avoid putting too much pressure on the internet server.
Tags:
Previous
Next