Zhihu

来自集智百科
跳转到: 导航搜索

目录

原始数据集

知乎数据的三个主要层级:Question、Answer、Comment

【没有回答的问题】数据总结

数量:594638+692737= 1287375 个

时间跨度:无信息

目前已提取信息:tags

计划下一步提取:id、title

  1. 数据内容
  2. url : https://www.zhihu.com/question/37466655
  3. title : 欧亨利的《财神和爱神》有什么深层解读?
  4. source : 知乎
  5. category : 知乎
  6. content : {"question":{"title":"欧亨利的《财神和爱神》有什么深层解读?","detail":"欧亨利小说","tag":"|短篇小说|"},"answers":[]}
  7. id : 8205998
  8. md5 : a26a7c7cabb286a7018a89daf902db17

【有回答的问题】数据总结

数量:2005564个问题;2024834个回答,其中同问题回答max:13479【回答个数表格已出】

时间跨度:预估是从知乎开始到2017年3月【数据集太大,正在跑】

目前已提取:tags、length of the answers of questions

计划下一步提取:回答的时间戳、问题的时间戳

https://www.douban.com/note/670194577/

这里的所有id都是没有意义的。

  1. 数据内容
  2. {'category': '知乎V2',
  3. 'content': '{"url":"https://www.zhihu.com/question/19761165","title":"Mac OS X Lion 中不支持 Rosetta,已安装的 Jam Packs 是否不能正常运行?","question":{"title":"Mac OS X Lion 中不支持 Rosetta,已安装的 Jam Packs 是否不能正常运行?","detail":" 注:安装 Jam Packs 时需要从系统盘中安装 Rosetta。 ","tag":"|Mac||macOS||OS X Lion||音乐制作|"},"answers":[{"id":0,"content":"\\n 我已测试,Jam Packs 在 OS X Lion 中,Logic Pro 9.1.4 下可以正常运行。需要注意的是,在全新的 OS X Lion 中无法安装 Jam Packs,安装程序会不停弹出错误对话框。 \\n","created_at":"2011-07-23"},{"id":270769,"content":"\\n 没装过Jam Packs,但是Lion不支持Rosetta,WAR3不能跑。所以应该可以肯定Jam Packs也不行。 \\n","created_at":"2011-07-12"}]}',
  4. 'created_at': 1486640448756,
  5. 'id': 22199986,
  6. 'md5': '3bcc3fe286f9ae449657766c0ffed0cd',
  7. 'source': '知乎V2',
  8. 'title': 'Mac OS X Lion 中不支持 Rosetta,已安装的 Jam Packs 是否不能正常运行?',
  9. 'url': 'https://www.zhihu.com/question/19761165'}
from collections import defaultdict
import json
import sys
 
def flush_print(s):
    sys.stdout.write('\r')
    sys.stdout.write('%s' % s)
    sys.stdout.flush()
 
bigfile = open(r'/Volumes/My Book/data/zhihu/zhihuv2_2.txt', 'r')
chunkSize = 10**8
chunk = bigfile.readlines(chunkSize)
clock = 0
with open(r'/Volumes/My Book/data/zhihu/zhihuv2_1_clean.txt', 'w') as f:
    while chunk:
        blocks = []
        clock += 1
        if clock % 10 == 0:
            flush_print(clock)
        for line in chunk:
            try:
                js = json.loads(line)
                jsc = json.loads(js['content'])
                qid, qtime, tags = js['url'].split('/')[-1],js['created_at'], jsc['question']['tag']
                ainf = [i['created_at'] for i in jsc['answers']]
                line_str = '\t'.join([str(qid), str(qtime)]) +'\t"'+tags+ '"\t'+ '"'+'||'.join(ainf) + '"'
                blocks.append(line_str)
            except Exception as e:
                print(e)
                pass
        for i in blocks:
            f.write(i + '\n')
        chunk = bigfile.readlines(chunkSize)
 
bigfile = open(r'/Volumes/My Book/data/zhihu/zhihuv2_1.txt', 'r')
chunkSize = 10**8
chunk = bigfile.readlines(chunkSize)
clock = 0
with open(r'/Volumes/My Book/data/zhihu/zhihuv2_1_clean.txt', 'a') as f:
    while chunk:
        blocks = []
        clock += 1
        if clock % 10 == 0:
            flush_print(clock)
        for line in chunk:
            try:
                js = json.loads(line)
                jsc = json.loads(js['content'])
                qid, qtime, tags = js['url'].split('/')[-1],js['created_at'], jsc['question']['tag']
                ainf = [i['created_at'] for i in jsc['answers']]
                line_str = '\t'.join([str(qid), str(qtime)]) +'\t"'+tags+ '"\t'+ '"'+'||'.join(ainf) + '"'
                blocks.append(line_str)
            except Exception as e:
                print(e)
                pass
        for i in blocks:
            f.write(i + '\n')
        chunk = bigfile.readlines(chunkSize)

【评论】数据总结

数量:4621085个评论

时间跨度:2017年2月24日至3月7日(12天)

目前已提取:评论的时间、id

  1. 数据内容
  2. {'class': '知乎评论',
  3. 'content': '[{"liked":false,"own":false,"inReplyToCommentId":0,"featured":false,"href":"/r/answers/52428891/comments/233502562","reviewing":false,"disliked":false,"dislikesCount":0,"id":233502562,"author":{"isSelf":false,"bio":"it","meta":{"isAnswerAuthor":false,"isQuestionCreator":false},"name":"chenglin","isOrg":false,"url":"http://www.zhihu.com/people/chenglin-43","slug":"chenglin-43","avatar":{"id":"da8e974dc","template":"https://pic1.zhimg.com/{id}_{size}.jpg"}},"content":"知识营销指的是?","inReplyToUser":null,"createdTime":"2017-01-22T18:31:50+08:00","collapsed":false,"likesCount":0}]',
  4. 'created_at': 1487920495203,
  5. 'id': 31459999,
  6. 'md5': '385d927abfd05bafafa3b86847031013',
  7. 'source': '知乎',
  8. 'title': '知乎评论 | 52428891',
  9. 'url': 'https://www.zhihu.com/r/answers/52428891/comments'}
个人工具
名字空间
操作
导航
工具箱