Tuesday, 23 March 2021

Web Scraping in JavaScript

I am trying to scrape a webpage in JavaScript which looks as follows:

enter image description here

The code shown is part of a larger loop, that loops through each repo and scrapes it's contents. I've confirmed that I'm able to capture the first element of every repo item on the page (so the javascript of "33-js-concepts", the react of "playground", the react of "react-google-static", etc.) and can scrape the all items in the first repo (so javascript, concepts, nodejs, react, angular, etc.) but keep getting this error with subsequent loops. Here is my code:

r.topic = []; // topics used in the repo:
var topics = $('.topics-row-container > a', parent);
    if(topics && topics.length > 0) {
      for (var i in topics) {
        r.topic.push(topics[i].children[0].data.replace(/^\s+|\s+$/g, ''));
        
    }
    console.log(r.topic);

The first loop produces the expected result, with console.log(r.topic) printing:

[
  'javascript',
  'concepts',
  'nodejs',
  'react',
  'angular',
  'programming',
  'javascript-programming'
]

But subsequent loops produce the following error:

r.topic.push(topics[i].children[0].data.replace(/^\s+|\s+$/g, ''));
                                       ^
TypeError: Cannot read property '0' of undefined

I'm new to javascript so am thinking I'm missing something obvious but I can't understand why children would be throwing this error. I even tried making it so children would increment by one with each loop, but I still saw the same error.

I would really appreciate any help!

UPDATE: topics printed to the console looks as follows:

children: [ [Node] ],
    parent: Node {
      type: 'tag',
      name: 'div',
      namespace: 'http://www.w3.org/1999/xhtml',
      attribs: [Object: null prototype],
      'x-attribsNamespace': [Object: null prototype],
      'x-attribsPrefix': [Object: null prototype],
      children: [Array],
      parent: [Node],
      prev: [Node],
      next: [Node]
    },
    prev: Node {
      type: 'text',
      data: '\n          ',
      parent: [Node],
      prev: [Node],
      next: [Circular *7]
    },
    next: Node {
      type: 'text',
      data: '\n      ',
      parent: [Node],
      prev: [Circular *7],
      next: null
    }
  },
  options: { xml: false, decodeEntities: true },
  _root: <ref *8> initialize {
    '0': Node {
      type: 'root',
      name: 'root',
      parent: null,
      prev: null,
      next: null,
      children: [Array],
      'x-mode': 'no-quirks'
    },


from Web Scraping in JavaScript

No comments:

Post a Comment