爬虫实例 | 网络暴民面前，潘长江也得认识蔡徐坤！ | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

爬虫实例 | 网络暴民面前，潘长江也得认识蔡徐坤！

2019年04月02日由 yining 发表 91924 0

网络暴民，键盘喷子，网络时代的黑白无常

本文作者：小F

文章来源公众号：法纳斯特（walker398）

原标题：网络暴力有多可怕？

本文已经原作者授权发布，内容稍有改动，转载文章请联系原作者。

这个事件起因是因为潘长江老师不认识蔡徐坤，在综艺节目上没认出来蔡徐坤本人，随后不到一个小时的时间，潘长江老师的微博评论区就被各种网络喷子和水军占领，随后潘长江老师不得不发微博澄清这件事情，同时蔡徐坤也在微博下方评论，希望“不要被别有用心的人得逞”。

不得不说微博的键盘侠、喷子、黑粉是真的无法无天，经常强行带节奏。比如自己一直关注的英雄联盟，再上周王校长也是被带了一波节奏，事情源于姿态退役后又复出的一条微博。

本来是一句很普通的调侃回复，「离辣个传奇adc的回归，还远吗？[二哈]」。然后就有人开始带王校长的节奏，直接把王校长给惹毛了。

上面这些事情，对于我这个吃瓜群众，也没什么好说的。只是希望以后能没有那么多无聊的人去带节奏，强行给他人带来压力。本次通过获取潘长江老师那条微博的评论用户信息，来分析一波。一共是获取了3天的评论，共14万条，这个数量真的可怕。

/ 01 / 前期工作

微博评论信息获取就不细说，之前也讲过了。这里提一下用户信息获取，同样从移动端下手。

主要是获取用户的昵称、性别、地区、微博数、关注数、粉丝数。另外本次的数据存储采用MySQL数据库。

创建数据库。

import pymysql



db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306)

cursor = db.cursor()

cursor.execute("CREATE DATABASE weibo DEFAULT CHARACTER SET utf8mb4")

db.close()

创建表格以及设置字段信息。

import pymysql



db = pymysql.connect(host='127.0.0.1', user='root', password='774110919', port=3306, db='weibo')

cursor = db.cursor()

sql = 'CREATE TABLE IF NOT EXISTS comments (user_id VARCHAR(255) NOT NULL, user_message VARCHAR(255) NOT NULL, weibo_message VARCHAR(255) NOT NULL, comment VARCHAR(255) NOT NULL, praise VARCHAR(255) NOT NULL, date VARCHAR(255) NOT NULL, PRIMARY KEY (comment, date))'

cursor.execute(sql)

db.close()

/ 02 / 数据获取

具体代码如下。

from copyheaders import headers_raw_to_dict

from bs4 import BeautifulSoup

import requests

import pymysql

import re



headers = b"""

accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8

accept-encoding:gzip, deflate, br

accept-language:zh-CN,zh;q=0.9

cache-control:max-age=0

cookie:你的参数

upgrade-insecure-requests:1

user-agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36

"""



# 将请求头字符串转化为字典

headers = headers_raw_to_dict(headers)





def to_mysql(data):

    """

    信息写入mysql

    """

    table = 'comments'

    keys = ', '.join(data.keys())

    values = ', '.join(['%s'] * len(data))

    db = pymysql.connect(host='localhost', user='root', password='774110919', port=3306, db='weibo')

    cursor = db.cursor()

    sql = 'INSERT INTO {table}({keys}) VALUES ({values})'.format(table=table, keys=keys, values=values)

    try:

        if cursor.execute(sql, tuple(data.values())):

            print("Successful")

            db.commit()

    except:

        print('Failed')

        db.rollback()

    db.close()





def get_user(user_id):

    """

    获取用户信息

    """

    try:

        url_user = 'https://weibo.cn' + str(user_id)

        response_user = requests.get(url=url_user, headers=headers)

        soup_user = BeautifulSoup(response_user.text, 'html.parser')

        # 用户信息

        re_1 = soup_user.find_all(class_='ut')

        user_message = re_1[0].find(class_='ctt').get_text()

        # 微博信息

        re_2 = soup_user.find_all(class_='tip2')

        weibo_message = re_2[0].get_text()

        return (user_message, weibo_message)

    except:

        return ('未知', '未知')





def get_message():

    # 第一页有热门评论,拿取信息较麻烦,这里偷个懒~

    for i in range(2, 20000):

        data = {}

        print('第------------' + str(i) + '------------页')

        # 请求网址

        url = 'https://weibo.cn/comment/Hl2O21Xw1?uid=1732460543&rl=0&page=' + str(i)

        response = requests.get(url=url, headers=headers)

        html = response.text

        soup = BeautifulSoup(html, 'html.parser')

        # 评论信息

        comments = soup.find_all(class_='ctt')

        # 点赞数

        praises = soup.find_all(class_='cc')

        # 评论时间

        date = soup.find_all(class_='ct')

        # 获取用户名

        name = re.findall('id="C_.*?href="/.*?">(.*?)', html)

        # 获取用户ID

        user_ids = re.findall('id="C_.*?href="(.*?)">(.*?)', html)



        for j in range(len(name)):

            # 用户ID

            user_id = user_ids[j][0]

            (user_message, weibo_message) = get_user(user_id)

            data['user_id'] = " ".join(user_id.split())

            data['user_message'] = " ".join(user_message.split())

            data['weibo_message'] = " ".join(weibo_message.split())

            data['comment'] = " ".join(comments[j].get_text().split())

            data['praise'] = " ".join(praises[j * 2].get_text().split())

            data['date'] = " ".join(date[j].get_text().split())

            print(data)

            # 写入数据库中

            to_mysql(data)





if __name__ == '__main__':

    get_message()

最后成功获取评论信息。

3天14万条评论，着实可怕。

有时我不禁在想，到底是谁天天会那么无聊去刷评论。

职业黑粉，职业水军吗？

好像还真的有。

/ 03 / 数据清洗

清洗代码如下。

import pandas as pd

import pymysql



# 设置列名与数据对齐

pd.set_option('display.unicode.ambiguous_as_wide', True)

pd.set_option('display.unicode.east_asian_width', True)

# 显示10列

pd.set_option('display.max_columns', 10)

# 显示10行

pd.set_option('display.max_rows', 10)

# 设置显示宽度为500,这样就不会在IDE中换行了

pd.set_option('display.width', 2000)



# 读取数据

conn = pymysql.connect(host='localhost', user='root', password='774110919', port=3306, db='weibo', charset='utf8mb4')

cursor = conn.cursor()

sql = "select * from comments"

db = pd.read_sql(sql, conn)



# 清洗数据

df = db['user_message'].str.split(' ', expand=True)

# 用户名

df['name'] = df[0]

# 性别及地区

df1 = df[1].str.split('/', expand=True)

df['gender'] = df1[0]

df['province'] = df1[1]

# 用户ID

df['id'] = db['user_id']

# 评论信息

df['comment'] = db['comment']

# 点赞数

df['praise'] = db['praise'].str.extract('(\d+)').astype("int")

# 微博数,关注数,粉丝数

df2 = db['weibo_message'].str.split(' ', expand=True)

df2 = df2[df2[0] != '未知']

df['tweeting'] = df2[0].str.extract('(\d+)').astype("int")

df['follows'] = df2[1].str.extract('(\d+)').astype("int")

df['followers'] = df2[2].str.extract('(\d+)').astype("int")

# 评论时间

df['time'] = db['date'].str.split(':', expand=True)[0]

df['time'] = pd.Series([i+'时' for i in df['time']])

df['day'] = df['time'].str.split(' ', expand=True)[0]

# 去除无用信息

df = df.ix[:, 3:]

df = df[df['name'] != '未知']

df = df[df['time'].str.contains("日")]

# 随机输出10行数据

print(df.sample(10))

输出数据。

随机输出十条，就大致能看出评论区是什么画风了。

/ 04 / 数据可视化

01 评论用户性别情况

通过用户ID对数据去重后，剩下约10万+用户。

第一张图为所有用户的性别情况，其中男性3万+，女性7万+。

这确实也符合蔡徐坤的粉丝群体。

第二张图是因为之前看到「Alfred数据室」对于蔡徐坤粉丝群体的分析。

提到了很多蔡徐坤的粉丝喜欢用带有「坤、蔡、葵、kun」的昵称。

所以将昵称包含这些字的用户提取出来。

果不其然，女性1.2万+，男性900+，更加符合了蔡徐坤的粉丝群体。

可视化代码如下。

from pyecharts import Pie, Map, Line





def create_gender(df):

    # 全部用户

    # df = df.drop_duplicates('id')

    # 包含关键字用户

    df = df[df['name'].str.contains("坤|蔡|葵|kun")].drop_duplicates('id')

    # 分组汇总

    gender_message = df.groupby(['gender'])

    gender_com = gender_message['gender'].agg(['count'])

    gender_com.reset_index(inplace=True)



    # 生成饼图

    attr = gender_com['gender']

    v1 = gender_com['count']

    # pie = Pie("微博评论用户的性别情况", title_pos='center', title_top=0)

    # pie.add("", attr, v1, radius=[40, 75], label_text_color=None, is_label_show=True, legend_orient="vertical", legend_pos="left", legend_top="%10")

    # pie.render("微博评论用户的性别情况.html")

    pie = Pie("微博评论用户的性别情况(昵称包含关键字)", title_pos='center', title_top=0)

    pie.add("", attr, v1, radius=[40, 75], label_text_color=None, is_label_show=True, legend_orient="vertical", legend_pos="left", legend_top="%10")

    pie.render("微博评论用户的性别情况(昵称包含关键字).html")

02 评论用户区域分布

广东以8000+的评论用户居于首位，随后则是北京、山东，江苏，浙江，四川。

这里也与之前网易云音乐评论用户的分布有点相似。

更加能说明这几个地方的网民不少。

可视化代码如下。

def create_map(df):

    # 全部用户

    df = df.drop_duplicates('id')

    # 分组汇总

    loc_message = df.groupby(['province'])

    loc_com = loc_message['province'].agg(['count'])

    loc_com.reset_index(inplace=True)



    # 绘制地图

    value = [i for i in loc_com['count']]

    attr = [i for i in loc_com['province']]

    map = Map("微博评论用户的地区分布图", title_pos='center', title_top=0)

    map.add("", attr, value, maptype="china", is_visualmap=True, visual_text_color="#000", is_map_symbol_show=False, visual_range=[0, 7000])

    map.render('微博评论用户的地区分布图.html')

03 评论用户关注数分布

整体上符合常态，不过我也很好奇那些关注上千的用户，是什么样的一个存在。

可视化代码如下。

def create_follows(df):

    """

    生成评论用户关注数情况

    """

    df = df.drop_duplicates('id')

    follows = df['follows']

    bins = [0, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000]

    level = ['0-10', '10-20', '20-50', '50-100', '100-200', '200-500', '500-1000', '1000-2000', '2000-5000', '5000-10000', '10000以上']

    len_stage = pd.cut(follows, bins=bins, labels=level).value_counts().sort_index()

    # 生成柱状图

    attr = len_stage.index

    v1 = len_stage.values

    bar = Bar("评论用户关注数分布情况", title_pos='center', title_top='18', width=800, height=400)

    bar.add("", attr, v1, is_stack=True, is_label_show=True, xaxis_interval=0, xaxis_rotate=30)

    bar.render("评论用户关注数分布情况.html")

04 评论用户粉丝数分布

这里发现粉丝数为「0-10」的用户不少，估摸着应该是水军在作怪了。

粉丝数为「50-100」的用户最多。

可视化代码如下。

def create_follows(df):

    """

    生成评论用户关注数情况

    """

    df = df.drop_duplicates('id')

    follows = df['follows']

    bins = [0, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000]

    level = ['0-10', '10-20', '20-50', '50-100', '100-200', '200-500', '500-1000', '1000-2000', '2000-5000', '5000-10000', '10000以上']

    len_stage = pd.cut(follows, bins=bins, labels=level).value_counts().sort_index()

    # 生成柱状图

    attr = len_stage.index

    v1 = len_stage.values

    bar = Bar("评论用户关注数分布情况", title_pos='center', title_top='18', width=800, height=400)

    bar.add("", attr, v1, is_stack=True, is_label_show=True, xaxis_interval=0, xaxis_rotate=30)

    bar.render("评论用户关注数分布情况.html")

05 评论时间分布

潘老师是在17时发出微博的，但是那时并没有大量的评论出现，那个小时一共有1237条评论。

直到蔡徐坤在18时评论后，微博的评论一下就上去了，24752条。

而且目前一半的评论都是在蔡徐坤的回复底下评论，点赞数多的也大多都在其中。

不得不说蔡徐坤的粉丝力量真大，可怕可怕~

可视化代码如下。

def creat_date(df):

    # 分组汇总

    date_message = df.groupby(['time'])

    date_com = date_message['time'].agg(['count'])

    date_com.reset_index(inplace=True)



    # 绘制走势图

    attr = date_com['time']

    v1 = date_com['count']

    line = Line("微博评论的时间分布", title_pos='center', title_top='18', width=800, height=400)

    line.add("", attr, v1, is_smooth=True, is_fill=True, area_color="#000", xaxis_interval=24, is_xaxislabel_align=True, xaxis_min="dataMin", area_opacity=0.3, mark_point=["max"], mark_point_symbol="pin", mark_point_symbolsize=55)

    line.render("微博评论的时间分布.html")

06 评论词云

大体上言论还算好，没有很偏激。

可视化代码如下。

from wordcloud import WordCloud, ImageColorGenerator

import matplotlib.pyplot as plt

import jieba





def create_wordcloud(df):

    """

    生成评论词云

    """

    words = pd.read_csv('chineseStopWords.txt', encoding='gbk', sep='\t', names=['stopword'])

    # 分词

    text = ''

    for line in df['comment']:

        line = line.split(':')[-1]

        text += ' '.join(jieba.cut(str(line), cut_all=False))

    # 停用词

    stopwords = set('')

    stopwords.update(words['stopword'])

    backgroud_Image = plt.imread('article.jpg')

    wc = WordCloud(

        background_color='white',

        mask=backgroud_Image,

        font_path='C:\Windows\Fonts\华康俪金黑W8.TTF',

        max_words=2000,

        max_font_size=150,

        min_font_size=15,

        prefer_horizontal=1,

        random_state=50,

        stopwords=stopwords

    )

    wc.generate_from_text(text)

    img_colors = ImageColorGenerator(backgroud_Image)

    wc.recolor(color_func=img_colors)

    # 高词频词语

    process_word = WordCloud.process_text(wc, text)

    sort = sorted(process_word.items(), key=lambda e: e[1], reverse=True)

    print(sort[:50])

    plt.imshow(wc)

    plt.axis('off')

    wc.to_file("微博评论词云.jpg")

    print('生成词云成功!')

/ 04 / 总结

最后，照例来扒一扒哪位用户评论最多。

这位男性用户，一共评论了90条，居于首位。

评论画风有点迷，是来搅局的吗？

这位女性用户，一共评论了80条。

大部分内容都是围绕黑粉去说的。

这位女性用户，一共评论了71条。

疯狂与评论区互动...

这位男性用户，一共评论了68条。

也在与评论区互动，不过大多数评论情感倾向都是偏消极的。

观察了评论数最多的10名用户，发现其中男性用户的评论都是偏负面的，女性评论都是正面的。

好了，作为一名吃瓜群众，我是看看就好，也就不发表什么言论了。

下面这段内容摘自公众号「老马小刀」，个人感觉有理。

标签：

Python 学习网络安全网络暴力

0 评论

欢迎关注ATYUN官方公众号

商务合作及内容投稿请联系邮箱:bd@atyun.com

上一篇白话机器学习算法 Part 1

下一篇白话机器学习算法 Part 2

评论登录

要发表评论，您必须先登录。

jonatasgrosman/wav2vec2-large-xlsr-53-english facebook/dino-vitb16 bert-base-uncased xlm-roberta-large xlm-roberta-base gpt2 microsoft/resnet-50 facebook/dino-vits8

最好的基于Transformer的LLM（上）