python Comments 11 min read

Table of Contents

Introduction

This post describes an interesting issue I ran into while using Python’s grequests module.

If you have to download n pages of a site with the format https://site?page=n where n = 1 to 100 in python, the following snippet of code would be very slow as it would download them one by one.

An obvious solution in the Python world is to use the grequests module, wherein one would write code similar to the following to get speed gains:

The above code runs faster because:

grequests = gevent + requests

where

gevent = greenlet + libev

At a very high level something like this happens:

When you use requests without gevent:

  1. Client opens a connection to the server
  2. The client sends the request.
  3. Client waits for server to respond.
  4. Server responds.
  5. Client goes back to Point 1. for next URL.

When you use grequests:

  1. Client opens a connection to the server
  2. The client sends the request.
  3. The client does not wait for the server to respond (an IO wait), it returns control back to an event loop by registering a callback which is fired when the server responds.
  4. Client does Point 1-3 for rest of the URLs.

How does grequests accomplish this?

  1. grequests imports gevent and uses gevent to monkey patch the standard lib.
  2. Now the underlying recv call is made non-blocking by gevent’s library.

What happens during monkey patching?

  1. Run the following code in your Python2.7 interpreter console:
  2.     >>> import inspect
        >>> import socket
        >>> inspect.getsourcefile(socket.ssl)
        '/usr/local/Cellar/python@2/2.7.15_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py'
        
  3. Close and re-open your Python2.7 interpreter console and run similar code but after monkey patching through gevent:
  4.     >>> from gevent import monkey
        >>> monkey.patch_all()
        True
        >>> import inspect
        >>> import socket
        >>> inspect.getsourcefile(socket.ssl)
        '/usr/local/lib/python2.7/site-packages/gevent/_socket2.py'
        
    As you noticed, monkey.patch_all ensures that if we reference any modules that gevent patches, we get their gevent variants instead of the vanilla ones.

To understand gevent in more detail go through:

Given this is how gevent works, using grequests to fetch 100 urls should do the trick. But as we will see in the later sections, that depends on the Python modules installed on your system.

Demo time!

Clone this repo which has the sample setup. Run this docker-compose command to get the test environment up and running:

This will create the following containers on your system:

  1. https_server_1 - A local golang https server which mimics the very useful httpbin.org, which provides the /delay endpoint to introduce latency in response.
  2. python27_n where n = 1 to 4 - Python2.7 containers with different pip modules installed which show the situations in which the code is fast or slow.
  3. python37_n where n = 1 to 4 - Same as above but for Python3.7

We will go through these stages and see grequests behaviour.

Each stage is going to differ from the previous in the list of modules installed on the docker container. The code that we run in these stages is going to stay the same.

Stage-0

Stage-1

Stage-2

Why is Stage-2 slow and Stage-1 fast with the same code?

Monkey patching and re-patching.

Let’s use the already existing Stage-1 and Stage-2 Python interpreter console’s to clear this out. Keep in mind that Stage-2 had pyopenssl installed, while Stage-1 didn’t. We will demo with Python2.7 as the same output applies to Python3.7 also.

Profiling Stage-1 and Stage-2 code.

  1. Stage-1 Python2.7 profiling output
  2. Stage-2 Python2.7 profiling output

If you observe the output for Stage-2 you will see that there is a call to wait:

30    0.001    0.000   10.042    0.335 /usr/local/lib/python2.7/site-packages/urllib3/util/wait.py:99(do_poll)

The source of /usr/local/lib/python2.7/site-packages/urllib3/util/wait.py:99(do_poll):

def do_poll(t):
    if t is not None:
        t *= 1000
    return poll_obj.poll(t)

return bool(_retry_on_intr(do_poll, timeout))

We see that there is an explicit wait of 1 second (1000 ms) happening on when the socket is blocked. One more interesting observation is that for Stage-2 the recv socket call is made from pyopenssl.py. which leads to polled waiting:

20/10    0.002    0.000   10.030    1.003 /usr/local/lib/python2.7/site-packages/urllib3/contrib/pyopenssl.py:271(recv)

For Stage-1 the recv socket call is made from gevent which doesn’t get blocked on IO and moves on to processing the next request:

10/1    0.001    0.000    1.294    1.294 /usr/local/lib/python2.7/site-packages/gevent/_sslgte279.py:448(recv)

Hence there is no polled wait.

Should I just uninstall pyopenssl and be fast again?

You can do that and get the speed gains. But as per this:

pyOpenSSL was originally created by Martin Sjögren because the SSL support in the standard library in Python 2.1 
(the contemporary version of Python when the pyOpenSSL project was begun) was severely limited. Other OpenSSL wrappers 
for Python at the time were also limited, though in different ways.

In my case, I wanted to do HTTPS verification (verify=True flag in requests). The latest ssl library has these features.

$ python2
>>> import ssl
>>> ssl.HAS_SNI
True
>>> import requests
>>> requests.get('https://google.com', verify=True)
<Response [200]>

But if you are using Python older than 2.7.9, then you have to use pyopenssl for the above features:

The following commands are run on Python 2.7.8 interpreter without pyopenssl:

$ ~/.pyenv/versions/2.7.8/bin/python
>>> import ssl
>>> ssl.HAS_SNI
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'HAS_SNI'
>>> import requests
>>> requests.get('https://google.com', verify=True)
~/.pyenv/versions/2.7.8/lib/python2.7/site-packages/urllib3/util/ssl_.py:354: SNIMissingWarning: An HTTPS request has been made, but the SNI (Server Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  SNIMissingWarning
~/.pyenv/versions/2.7.8/lib/python2.7/site-packages/urllib3/util/ssl_.py:150: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecurePlatformWarning
~/.pyenv/versions/2.7.8/lib/python2.7/site-packages/urllib3/util/ssl_.py:150: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecurePlatformWarning
<Response [200]>

If we install pyopenssl and retry verification, it goes through:

$ ~/.pyenv/versions/2.7.8/bin/pip install PyOpenSSL
$ ~/.pyenv/versions/2.7.8/bin/python
>>> import ssl
>>> ssl.HAS_SNI
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'HAS_SNI'
>>> import requests
>>> requests.get('https://google.com', verify=True)
<Response [200]>

But what if you need to use other features of pyopenssl related to certificate management which may or may not be in the core ssl library? gevent-openssl has got you covered. It is a gevent wrapper over pyopenssl so that you can use pyopenssl and prevent the re-patching scenario that we ran into.

However, installing gevent-openssl solves the problem of running fast with pyopenssl on Python2.7 but it doesn’t do so for Python3.7. We will learn more about the same in my next post as we explore Stage-3 and Stage-4.

Summary

Share this post

Comments