• 12
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

name Punditsdkoslkdosdkoskdo

How to use Python requests to fake a browser visit?

I want to get the content from the below website. If I use a browser like Firefox or Chrome I could get the real website page I want, but if I use the Python requests package (or wget command) to get it, it returns a totally different HTML page. I thought the developer of the website had made some blocks for this, so the question is:

How do I fake a browser visit by using python requests or command wget?

http://www.ichangtou.com/#company:data_000008.html

Provide a User-Agent header:

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

FYI, here is a list of User-Agent strings for different browsers:


As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:

fake-useragent

Up to date simple useragent faker with real world database

Demo:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'
  • 278
Reply Report
    • thanks for your answer, I tried with the headers in my requests but still could not get the real content of the page, there's a string 'Your web browser must have JavaScript enabled in order for this application to display correctly.' in the returned html page, should I add java script support in the requests? If so how would I do that?
      • 2
    • @user1726366: You can't simply add JavaScript support - you need a JavaScript interpreter for that. The simplest approach is to use the JavaScript interpreter of a real Web browser, but you can automate that from Python using Selenium.
    • @alecxe,@sputnick: I tried to capture the packets with wireshark to compare the difference from using python requests and browser, seems like the website url isn't a static one I have to wait for the page render to complete, so Selenium sounds the right tools for me. Thank you for your kind help. :)
      • 1
    • @user1726366 yup, if using a real browser+selenium fits your needs then this is the most painless approach. Note that you can use PhantomJS headless browser with selenium. Thanks. (don't forget to accept the answer if it was helpful)

if this question is still valid

I used fake UserAgent

How to use:

from fake_useragent import UserAgent
import requests


ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

outPut:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>
  • 30
Reply Report
      • 1
    • Could you please ping the link here? I can try at my end. Further if IP is blocked then error code should be 403(forbidden) or 401(unauthorised). There are websites which do not allow scraping at all. Further many websites user cloudflare to avoid bots to access website.
      • 1
    • Here is my link regalbloodline.com/music/eminem. It worked fine before. Stopped working on python 2. Worked on python 3 on local machine. Move to AWS EC2 did not work there. Kept getting Error 404. Then stopped working on local machine too. Using browser emulation worked on local machine but not on EC2. In the end I gave up and found alternative website to scrape. By the way is cloudfire could be avoided ?

Try doing this, using firefox as fake user agent (moreover, it's a good startup script for web scraping with the use of cookies):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

USAGE:

python script.py "http://www.ichangtou.com/#company:data_000008.html"
  • 7
Reply Report

The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. What I have found is I am able to get all of the information I wanted on a website in json before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing html hoping each webpage is in the same format.

So when you get a response from a website using requests really look at the html/text because you might find the javascripts JSON in the footer ready to be parsed.

  • 3
Reply Report

Trending Tags