【Mac】Python の MeCab で YouTube コメントを 形態素解析にかける

せっかく YouTube Data API でコメントを抽出できるので、今回は YouTube のコメント抽出の方法の紹介とともに、コメント欄にどんな単語が出てくるのか Python の MeCab で形態素解析をしたいと思います。

とりあえず Mac ローカル環境で触りだけやってみます。

  1. YouTube Data API でコメント情報を抽出
  2. YouTube 動画のコメントを MeCab で処理する

ちなみに API キーの取得やライブラリのインストールがまだの場合は下記記事をどうぞ。

1. YouTube Data API でコメント情報を抽出

まずはコメント情報を抽出するところから。コメント関連では CommentsThreads と Comments の二つがあり、両方とも JSON 形式で取得する事ができます。

CommentThreads

まず、CommentThreads では動画 ID やチャンネル ID をもとに、それらの ID に紐づくコメントを抽出することができます。

例えば動画 ID 「fdsaZ8EMR2U」のコメント 5 件のデータを取る場合はこんな感じ。

# -*- coding: utf-8 -*-

# Sample Python code for youtube.commentThreads.list
# See instructions for running these code samples locally:
# https://developers.google.com/explorer-help/guides/code_samples#python

import os

import googleapiclient.discovery

def main():
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    DEVELOPER_KEY = "YOUR_API_KEY"

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey = DEVELOPER_KEY)

    request = youtube.commentThreads().list(
        part="id,replies,snippet",
        maxResults=5,
        videoId="fdsaZ8EMR2U"
    )
    response = request.execute()

    print(response)

if __name__ == "__main__":
    main()

下記の様な JSON が返ってきます。コメントの内容、投稿主のチャンネル名などが含まれています。そして「replies」にはコメントに対する返信も含まれます。

{
  "kind": "youtube#commentThreadListResponse",
  "etag": "tSw5WSiFS4IMMytcgoYXJ9zpu6I",
  "nextPageToken": "QURTSl9pMDFKVnZtTFl0NFZOdnhaZFpXaFBOcWU5aDA0QWM5bDVpYk5oVTd1WDQwSDY1cU11OVBOZHNnWFNOTmNJby1Db1JpWno2Qnd5bw==",
  "pageInfo": {
    "totalResults": 5,
    "resultsPerPage": 5
  },
  "items": [
    {
      "kind": "youtube#commentThread",
      "etag": "An3zz04lgE7jUVO7VXXmqfdwFjk",
      "id": "UgwDY44NUll4uiZXqqx4AaABAg",
      "snippet": {
        "videoId": "fdsaZ8EMR2U",
        "topLevelComment": {
          "kind": "youtube#comment",
          "etag": "Kty1w2F4dTbXmakl-ywdK28vLEg",
          "id": "UgwDY44NUll4uiZXqqx4AaABAg",
          "snippet": {
            "videoId": "fdsaZ8EMR2U",
            "textDisplay": "SO gooood",
            "textOriginal": "SO gooood",
            "authorDisplayName": "fatt musiek",
            "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwngpQ-20jVq0c-9aC-wDJ87aTKi2QvPLTRN2GXGRaw=s48-c-k-c0x00ffffff-no-rj",
            "authorChannelUrl": "http://www.youtube.com/channel/UCl3ha3zwY9p6CemIZZXIdXQ",
            "authorChannelId": {
              "value": "UCl3ha3zwY9p6CemIZZXIdXQ"
            },
            "canRate": true,
            "viewerRating": "none",
            "likeCount": 0,
            "publishedAt": "2021-05-22T18:48:34Z",
            "updatedAt": "2021-05-22T18:48:34Z"
          }
        },
        "canReply": true,
        "totalReplyCount": 0,
        "isPublic": true
      }
    },
    {
      "kind": "youtube#commentThread",
      "etag": "QVJH5RHTNij1fN5jRj_mNcDscHA",
      "id": "Ugx8sUuwqKqPVG9eSuJ4AaABAg",
      "snippet": {
        "videoId": "fdsaZ8EMR2U",
        "topLevelComment": {
          "kind": "youtube#comment",
          "etag": "Y_SsBrGGxQoLpcztQqND9wGarUc",
          "id": "Ugx8sUuwqKqPVG9eSuJ4AaABAg",
          "snippet": {
            "videoId": "fdsaZ8EMR2U",
            "textDisplay": "so what if really yuffie have met johnny hehe",
            "textOriginal": "so what if really yuffie have met johnny hehe",
            "authorDisplayName": "GregOrio Barachina",
            "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwnjgJE6zBYksYQWt8TmKlMDYOyG0t-BHPNWWmvUUPQ=s48-c-k-c0x00ffffff-no-rj",
            "authorChannelUrl": "http://www.youtube.com/channel/UCUs2OJ4-KqYGS2EPJCDj7tQ",
            "authorChannelId": {
              "value": "UCUs2OJ4-KqYGS2EPJCDj7tQ"
            },
            "canRate": true,
            "viewerRating": "none",
            "likeCount": 0,
            "publishedAt": "2021-05-21T13:42:45Z",
            "updatedAt": "2021-05-21T13:42:45Z"
          }
        },
        "canReply": true,
        "totalReplyCount": 0,
        "isPublic": true
      }
    },
    {
      "kind": "youtube#commentThread",
      "etag": "ggLtp9jtNyrqb3JSvzkvUDon7gg",
      "id": "UgwP-4ucsrWh_iXJQMN4AaABAg",
      "snippet": {
        "videoId": "fdsaZ8EMR2U",
        "topLevelComment": {
          "kind": "youtube#comment",
          "etag": "Raxyf3_Zw3ksZyGeWDt7_HHW8SA",
          "id": "UgwP-4ucsrWh_iXJQMN4AaABAg",
          "snippet": {
            "videoId": "fdsaZ8EMR2U",
            "textDisplay": "The Aerith and Cloud scene is much more meaningful than the one with Tifa. Still a good scene but come on...Aerith just appearing amongst the flowers and getting to see her again...priceless",
            "textOriginal": "The Aerith and Cloud scene is much more meaningful than the one with Tifa. Still a good scene but come on...Aerith just appearing amongst the flowers and getting to see her again...priceless",
            "authorDisplayName": "Maxx Doran",
            "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwnixfMDBxLt_TfUEjlpHhU-OvwE1vjCgpFBAVIMxjg=s48-c-k-c0x00ffffff-no-rj",
            "authorChannelUrl": "http://www.youtube.com/channel/UCXcLTX_9fNHLVAMr_plxeqQ",
            "authorChannelId": {
              "value": "UCXcLTX_9fNHLVAMr_plxeqQ"
            },
            "canRate": true,
            "viewerRating": "none",
            "likeCount": 0,
            "publishedAt": "2021-05-18T01:50:04Z",
            "updatedAt": "2021-05-18T01:50:04Z"
          }
        },
        "canReply": true,
        "totalReplyCount": 0,
        "isPublic": true
      }
    },
    {
      "kind": "youtube#commentThread",
      "etag": "ptUVfOGBkZDnUXKFoDaGnQ9Y-gw",
      "id": "UgzTN_4ek7syWNNbCrB4AaABAg",
      "snippet": {
        "videoId": "fdsaZ8EMR2U",
        "topLevelComment": {
          "kind": "youtube#comment",
          "etag": "L3vDroBKOAklgIqKSkcX_JvLn_g",
          "id": "UgzTN_4ek7syWNNbCrB4AaABAg",
          "snippet": {
            "videoId": "fdsaZ8EMR2U",
            "textDisplay": "You caught on to the magnify barrier idea so early. I was part way into hard mode before I thought of that.",
            "textOriginal": "You caught on to the magnify barrier idea so early. I was part way into hard mode before I thought of that.",
            "authorDisplayName": "Justin Edwards",
            "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwngz1mU5zD3QHSRVU3jXTEZApnkYsmAzCKFXxUyD1w=s48-c-k-c0x00ffffff-no-rj",
            "authorChannelUrl": "http://www.youtube.com/channel/UCO-oPQJCpNw87M6YbcuuFMw",
            "authorChannelId": {
              "value": "UCO-oPQJCpNw87M6YbcuuFMw"
            },
            "canRate": true,
            "viewerRating": "none",
            "likeCount": 0,
            "publishedAt": "2021-05-10T07:25:36Z",
            "updatedAt": "2021-05-10T07:25:36Z"
          }
        },
        "canReply": true,
        "totalReplyCount": 0,
        "isPublic": true
      }
    },
    {
      "kind": "youtube#commentThread",
      "etag": "vfaqu09YbjpC_akz6riq0_XpSCw",
      "id": "UgygOOysmSAraKnx81h4AaABAg",
      "snippet": {
        "videoId": "fdsaZ8EMR2U",
        "topLevelComment": {
          "kind": "youtube#comment",
          "etag": "k8vIS0anrGkqCcFfyj0gnrUwXQI",
          "id": "UgygOOysmSAraKnx81h4AaABAg",
          "snippet": {
            "videoId": "fdsaZ8EMR2U",
            "textDisplay": "Y'know Max, you COULD have just run 5k steps in Aerith's garden, checked what the Materia did, and moved on. Or, maybe, look up an online guide, since by now I'm sure SOMEONE has posted one.",
            "textOriginal": "Y'know Max, you COULD have just run 5k steps in Aerith's garden, checked what the Materia did, and moved on. Or, maybe, look up an online guide, since by now I'm sure SOMEONE has posted one.",
            "authorDisplayName": "Soma Cruz the Demigod of Balance",
            "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwnilg2dkOBvJqeTbW34CBoxURHLWv78fnbCRkArv=s48-c-k-c0x00ffffff-no-rj",
            "authorChannelUrl": "http://www.youtube.com/channel/UCNaiemmWvNbzfaQm5e3hyqA",
            "authorChannelId": {
              "value": "UCNaiemmWvNbzfaQm5e3hyqA"
            },
            "canRate": true,
            "viewerRating": "none",
            "likeCount": 0,
            "publishedAt": "2021-05-07T02:21:29Z",
            "updatedAt": "2021-05-07T02:21:29Z"
          }
        },
        "canReply": true,
        "totalReplyCount": 1,
        "isPublic": true
      },
      "replies": {
        "comments": [
          {
            "kind": "youtube#comment",
            "etag": "oMSJ1drDJreTmguX72NWydzfbcY",
            "id": "UgygOOysmSAraKnx81h4AaABAg.9N10GKEX1209N2h_vQGc9Y",
            "snippet": {
              "videoId": "fdsaZ8EMR2U",
              "textDisplay": "\u003ca href=\"https://www.youtube.com/watch?v=fdsaZ8EMR2U&t=38m03s\"\u003e38:03\u003c/a\u003e These things don't stagger? But each time they clone, they lose health, and the clones are much weaker. Damn, I see why you had trouble with these, Max.",
              "textOriginal": "38:03 These things don't stagger? But each time they clone, they lose health, and the clones are much weaker. Damn, I see why you had trouble with these, Max.",
              "parentId": "UgygOOysmSAraKnx81h4AaABAg",
              "authorDisplayName": "Soma Cruz the Demigod of Balance",
              "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwnilg2dkOBvJqeTbW34CBoxURHLWv78fnbCRkArv=s48-c-k-c0x00ffffff-no-rj",
              "authorChannelUrl": "http://www.youtube.com/channel/UCNaiemmWvNbzfaQm5e3hyqA",
              "authorChannelId": {
                "value": "UCNaiemmWvNbzfaQm5e3hyqA"
              },
              "canRate": true,
              "viewerRating": "none",
              "likeCount": 0,
              "publishedAt": "2021-05-07T18:08:01Z",
              "updatedAt": "2021-05-07T18:08:01Z"
            }
          }
        ]
      }
    }
  ]
}

Comments

Comments ではコメント ID を直接指定してデータを取得します。上の例で取得した 5 つのコメント ID を「id」に指定してデータを取得してみます。

# -*- coding: utf-8 -*-

# Sample Python code for youtube.comments.list
# See instructions for running these code samples locally:
# https://developers.google.com/explorer-help/guides/code_samples#python

import os

import googleapiclient.discovery

def main():
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    DEVELOPER_KEY = "YOUR_API_KEY"

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey = DEVELOPER_KEY)

    request = youtube.comments().list(
        part="id,snippet",
        id="UgwDY44NUll4uiZXqqx4AaABAg,Ugx8sUuwqKqPVG9eSuJ4AaABAg,UgwP-4ucsrWh_iXJQMN4AaABAg,UgzTN_4ek7syWNNbCrB4AaABAg,UgygOOysmSAraKnx81h4AaABAg"
    )
    response = request.execute()

    print(response)

if __name__ == "__main__":
    main()

下記の様な JSON が返ってきます。CommentThreads の方ではコメントが投稿された動画の ID や、コメントに対する返信も含まれていましたが、Comments の方には含まれません。

{
  "kind": "youtube#commentListResponse",
  "etag": "vwaB3KAa_Snb_GuTkkMYrlL7Jrg",
  "items": [
    {
      "kind": "youtube#comment",
      "etag": "E9ovRZPTGOUQzHb0AEiKA26EJxY",
      "id": "UgwDY44NUll4uiZXqqx4AaABAg",
      "snippet": {
        "textDisplay": "SO gooood",
        "textOriginal": "SO gooood",
        "authorDisplayName": "fatt musiek",
        "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwngpQ-20jVq0c-9aC-wDJ87aTKi2QvPLTRN2GXGRaw=s48-c-k-c0x00ffffff-no-rj",
        "authorChannelUrl": "http://www.youtube.com/channel/UCl3ha3zwY9p6CemIZZXIdXQ",
        "authorChannelId": {
          "value": "UCl3ha3zwY9p6CemIZZXIdXQ"
        },
        "canRate": true,
        "viewerRating": "none",
        "likeCount": 0,
        "publishedAt": "2021-05-22T18:48:34Z",
        "updatedAt": "2021-05-22T18:48:34Z"
      }
    },
    {
      "kind": "youtube#comment",
      "etag": "FG5oWvmMF39kDNl_rlnzb0bsSWM",
      "id": "Ugx8sUuwqKqPVG9eSuJ4AaABAg",
      "snippet": {
        "textDisplay": "so what if really yuffie have met johnny hehe",
        "textOriginal": "so what if really yuffie have met johnny hehe",
        "authorDisplayName": "GregOrio Barachina",
        "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwnjgJE6zBYksYQWt8TmKlMDYOyG0t-BHPNWWmvUUPQ=s48-c-k-c0x00ffffff-no-rj",
        "authorChannelUrl": "http://www.youtube.com/channel/UCUs2OJ4-KqYGS2EPJCDj7tQ",
        "authorChannelId": {
          "value": "UCUs2OJ4-KqYGS2EPJCDj7tQ"
        },
        "canRate": true,
        "viewerRating": "none",
        "likeCount": 0,
        "publishedAt": "2021-05-21T13:42:45Z",
        "updatedAt": "2021-05-21T13:42:45Z"
      }
    },
    {
      "kind": "youtube#comment",
      "etag": "WQw2UyILXAhkYOAl-AKScZi1pCY",
      "id": "UgwP-4ucsrWh_iXJQMN4AaABAg",
      "snippet": {
        "textDisplay": "The Aerith and Cloud scene is much more meaningful than the one with Tifa. Still a good scene but come on...Aerith just appearing amongst the flowers and getting to see her again...priceless",
        "textOriginal": "The Aerith and Cloud scene is much more meaningful than the one with Tifa. Still a good scene but come on...Aerith just appearing amongst the flowers and getting to see her again...priceless",
        "authorDisplayName": "Maxx Doran",
        "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwnixfMDBxLt_TfUEjlpHhU-OvwE1vjCgpFBAVIMxjg=s48-c-k-c0x00ffffff-no-rj",
        "authorChannelUrl": "http://www.youtube.com/channel/UCXcLTX_9fNHLVAMr_plxeqQ",
        "authorChannelId": {
          "value": "UCXcLTX_9fNHLVAMr_plxeqQ"
        },
        "canRate": true,
        "viewerRating": "none",
        "likeCount": 0,
        "publishedAt": "2021-05-18T01:50:04Z",
        "updatedAt": "2021-05-18T01:50:04Z"
      }
    },
    {
      "kind": "youtube#comment",
      "etag": "Vchl7kutnRgZb-uKYjMNMrJQ2qQ",
      "id": "UgzTN_4ek7syWNNbCrB4AaABAg",
      "snippet": {
        "textDisplay": "You caught on to the magnify barrier idea so early. I was part way into hard mode before I thought of that.",
        "textOriginal": "You caught on to the magnify barrier idea so early. I was part way into hard mode before I thought of that.",
        "authorDisplayName": "Justin Edwards",
        "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwngz1mU5zD3QHSRVU3jXTEZApnkYsmAzCKFXxUyD1w=s48-c-k-c0x00ffffff-no-rj",
        "authorChannelUrl": "http://www.youtube.com/channel/UCO-oPQJCpNw87M6YbcuuFMw",
        "authorChannelId": {
          "value": "UCO-oPQJCpNw87M6YbcuuFMw"
        },
        "canRate": true,
        "viewerRating": "none",
        "likeCount": 0,
        "publishedAt": "2021-05-10T07:25:36Z",
        "updatedAt": "2021-05-10T07:25:36Z"
      }
    },
    {
      "kind": "youtube#comment",
      "etag": "pbfhdpIgB5QCKnr4Inkm_U2wbjQ",
      "id": "UgygOOysmSAraKnx81h4AaABAg",
      "snippet": {
        "textDisplay": "Y'know Max, you COULD have just run 5k steps in Aerith's garden, checked what the Materia did, and moved on. Or, maybe, look up an online guide, since by now I'm sure SOMEONE has posted one.",
        "textOriginal": "Y'know Max, you COULD have just run 5k steps in Aerith's garden, checked what the Materia did, and moved on. Or, maybe, look up an online guide, since by now I'm sure SOMEONE has posted one.",
        "authorDisplayName": "Soma Cruz the Demigod of Balance",
        "authorProfileImageUrl": "https://yt3.ggpht.com/ytc/AAUvwnilg2dkOBvJqeTbW34CBoxURHLWv78fnbCRkArv=s48-c-k-c0x00ffffff-no-rj",
        "authorChannelUrl": "http://www.youtube.com/channel/UCNaiemmWvNbzfaQm5e3hyqA",
        "authorChannelId": {
          "value": "UCNaiemmWvNbzfaQm5e3hyqA"
        },
        "canRate": true,
        "viewerRating": "none",
        "likeCount": 0,
        "publishedAt": "2021-05-07T02:21:29Z",
        "updatedAt": "2021-05-07T02:21:29Z"
      }
    }
  ]
}

2. YouTube 動画のコメントを MeCab で処理する

次に、抽出したコメントに対して形態素解析をかけたいと思います。

まず仮想環境で下記を実行し、MeCab を使える様準備します。

$ brew install mecab
$ brew install mecab-ipadic
$ pip install mecab-python3

基本編:単一コメント取得〜形態素解析

MeCab の準備が済んだら下記の様に

import os
import googleapiclient.discovery
import MeCab

def main():
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    DEVELOPER_KEY = "YOUR_API_KEY"

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey = DEVELOPER_KEY)

    # コメント ID を指定してコメントを取得
    request = youtube.comments().list(
        part="id,snippet",
        id="UgzKZhtc3fX6JgX6p5p4AaABAg" # コメント ID 
    )
    response = request.execute()

    # 返された JSON からコメントの文章を取得し text に保存
    text = response['items'][0]['snippet']['textOriginal']
    
    m = MeCab.Tagger()
 
    node = m.parseToNode(text)
    text_after = []
    while node:
        words.append(node.surface)
        node = node.next

    print('処理前: '+str(text)) # MeCab 処理前の text
    print('処理後: '+str(text_after)) # MeCab 処理を行なった後の text
    
if __name__ == "__main__":
    main()

狩野英孝さんの動画のコメントで、上記のコードを実行すると下記の結果を返してくれます。

処理前: ここ最近英孝ちゃんの動画みてたらいつの間にか日付け変わってるんだけど笑
処理前: ['', 'ここ', '最近', '英孝', 'ちゃん', 'の', '動画', 'み', 'て', 'たら', 'いつの間にか', '日', '付け', '変わっ', 'てる', 'ん', 'だ', 'けど', '笑', '']

応用編:複数コメント取得〜頻出単語の表出

YouTube Data API でコメントスレッドを取得し、それを MeCab の処理に欠けて出現数の多い単語を抽出してみます。

import os
import googleapiclient.discovery
import MeCab
import collections

def main():
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    DEVELOPER_KEY = "YOUR_API_KEY"

    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey = DEVELOPER_KEY)

    # コメントスレッドの取得
    request = youtube.commentThreads().list(
        part="id,replies,snippet",
        maxResults=100, # 最大取得コメントスレッド数
        videoId="jsRR_ZimvAo", # 動画 ID
        order="relevance" # 関連性の高い順にコメントスレッドを取得
    )
    response = request.execute()

    comment_list = []
    for item in response['items']:

        comment_list.append(item['snippet']['topLevelComment']['snippet']['textOriginal'])
        if 'replies' in item.keys():
            for reply in item['replies']['comments']:

                comment_list.append(reply['snippet']['textOriginal'])
    
    # MeCab の処理
    comment_particles = []
    for comment in comment_list:
        m = MeCab.Tagger()
    
        node = m.parseToNode(comment)
        while node:
            if len(node.surface) > 0: # ''は処理から除外
                hinshi = node.feature.split(',')[0]
                if hinshi in ['名詞','形容詞']: # 名詞か形容詞に絞る
                    comment_particles.append(node.surface)
            
            node = node.next

    c = collections.Counter(comment_particles)
    
    # 出現数順に print
    for i in c.most_common(30):
        print(i)


if __name__ == "__main__":
    main()

お笑い芸人のさまぁ〜ずとぺこぱの動画なので下記の様な感じでタプルが返ってきます。

('シュウ', 23)
('ペイ', 23)
('さん', 21)
('笑', 17)
('好き', 11)
('ちゃん', 10)
('シュウペイ', 9)
('の', 9)
('ん', 9)
('ぺこぱ', 9)
('企画', 9)
('お前', 9)
('ツッコミ', 8)
('~', 8)
('寺', 7)
('かわいい', 7)
('最高', 7)
('松陰', 6)
('ショック', 6)
('回', 6)
('純粋', 6)
('人', 6)
('俺', 6)
('いい', 5)
('www', 5)
('w', 5)
('面白い', 5)
('ww', 5)
('❤', 5)
('ずさん', 4)

不自然なところで単語が切れてしまったりするのが気になりますが、とりあえずこんな風なことができるということで!

コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です