博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
python爬虫
阅读量:4355 次
发布时间:2019-06-07

本文共 4709 字,大约阅读时间需要 15 分钟。

1 简单方案(广度优先遍历):https://fossbytes.com/how-to-build-a-basic-web-crawler-in-python/

import sys, thread, Queue, re, urllib, urlparse, time, os, sysdupcheck = set()  q = Queue.Queue(100) q.put(sys.argv[1]) def queueURLs(html, origLink):     for url in re.findall('''
]+href=["'](.[^"']+)["']''', html, re.I): link = url.split("#", 1)[0] if url.startswith("http") else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] if link in dupcheck: continue dupcheck.add(link) if len(dupcheck) > 99999: dupcheck.clear() q.put(link) def getHTML(link): try: html = urllib.urlopen(link).read() open(str(time.time()) + ".html", "w").write("%s" % link + "\n" + html) queueURLs(html, link) except (KeyboardInterrupt, SystemExit): raise except Exception: passwhile True: thread.start_new_thread( getHTML, (q.get(),)) time.sleep(0.5)

 思路: 利用队列(Queue),进行广度优先遍历

 

2. 简单方案,搜索某个词语:http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/

from html.parser import HTMLParser  from urllib.request import urlopen  from urllib import parse# We are going to create a class called LinkParser that inherits some# methods from HTMLParser which is why it is passed into the definitionclass LinkParser(HTMLParser):    # This is a function that HTMLParser normally has    # but we are adding some functionality to it    def handle_starttag(self, tag, attrs):        # We are looking for the begining of a link. Links normally look        # like         if tag == 'a':            for (key, value) in attrs:                if key == 'href':                    # We are grabbing the new URL. We are also adding the                    # base URL to it. For example:                    # www.netinstructions.com is the base and                    # somepage.html is the new URL (a relative URL)                    #                    # We combine a relative URL with the base URL to create                    # an absolute URL like:                    # www.netinstructions.com/somepage.html                    newUrl = parse.urljoin(self.baseUrl, value)                    # And add it to our colection of links:                    self.links = self.links + [newUrl]    # This is a new function that we are creating to get links    # that our spider() function will call    def getLinks(self, url):        self.links = []        # Remember the base URL which will be important when creating        # absolute URLs        self.baseUrl = url        # Use the urlopen function from the standard Python 3 library        response = urlopen(url)        # Make sure that we are looking at HTML and not other things that        # are floating around on the internet (such as        # JavaScript files, CSS, or .PDFs for example)        if response.getheader('Content-Type')=='text/html':            htmlBytes = response.read()            # Note that feed() handles Strings well, but not bytes            # (A change from Python 2.x to Python 3.x)            htmlString = htmlBytes.decode("utf-8")            self.feed(htmlString)            return htmlString, self.links        else:            return "",[]# And finally here is our spider. It takes in an URL, a word to find,# and the number of pages to search through before giving updef spider(url, word, maxPages):      pagesToVisit = [url]    numberVisited = 0    foundWord = False    # The main loop. Create a LinkParser and get all the links on the page.    # Also search the page for the word or string    # In our getLinks function we return the web page    # (this is useful for searching for the word)    # and we return a set of links from that web page    # (this is useful for where to go next)    while numberVisited < maxPages and pagesToVisit != [] and not foundWord:        numberVisited = numberVisited +1        # Start from the beginning of our collection of pages to visit:        url = pagesToVisit[0]        pagesToVisit = pagesToVisit[1:]        try:            print(numberVisited, "Visiting:", url)            parser = LinkParser()            data, links = parser.getLinks(url)            if data.find(word)>-1:                foundWord = True                # Add the pages that we visited to the end of our collection                # of pages to visit:                pagesToVisit = pagesToVisit + links                print(" **Success!**")        except:            print(" **Failed!**")    if foundWord:        print("The word", word, "was found at", url)    else:        print("Word never found")

充分利用HTMLParser的一些特性

转载于:https://www.cnblogs.com/Tommy-Yu/p/6412277.html

你可能感兴趣的文章
凤姐讲学英语
查看>>
ActionBar
查看>>
5种方法实现数组去重
查看>>
2~15重点语法
查看>>
flask中的CBV,flash,Flask-Session,WTForms - MoudelForm,DBUtils 数据库连接池
查看>>
最近整理的提供免费代理列表的几个网站
查看>>
探偵ガリレオー転写る2
查看>>
快速排序算法C++实现[评注版]
查看>>
七尖记
查看>>
SAP(最短增广路算法) 最大流模板
查看>>
用极大化思想解决矩形问题学习笔记
查看>>
Django REST Framework 简单入门
查看>>
Hibernate中fetch和lazy介绍
查看>>
修改ip脚本
查看>>
解析xlsx与xls--使用2012poi.jar
查看>>
java5,java6新特性
查看>>
【LOJ】#2290. 「THUWC 2017」随机二分图
查看>>
SSL-ZYC 活动安排
查看>>
Git clone 报错 128
查看>>
在Python中执行普通除法
查看>>