Saturday, 15 December 2018

Trouble fetching some title from a webpage

I've written a script in php to scrape a title visible as hair fall shamboo from a webpage. When I execute my below script, I get the following error:

Notice: Trying to get property 'nodeValue' of non-object in C:\xampp\htdocs\runcode\testfile.php on line 16.

Link to that site

Script I've tried with:

<?php
    function get_content($url){
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_exec($ch);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        return $htmlContent;
    }
    $link = "https://www.purplle.com/search?q=hair%20fall%20shamboo"; 
    $xml = get_content($link);
    $dom = @DOMDocument::loadHTML($xml);
    $xpath = new DOMXPath($dom);
    $title = $xpath->query('//h1[@class="br-hdng"]/span')->item(0)->nodeValue;
    echo "{$title}";
?>

My expected output is:

hair fall shamboo

Although the xpath I used within my above script seems to be correct, I pasted here the relevant portion of html elements within which the title can be found:

<h1 _ngcontent-c0="" class="br-hdng"><span _ngcontent-c0="" class="pr dib">hair fall shamboo<!----></span></h1>

PostScript: The title I wish to parse gets loaded dynamically. As I'm new to php I don't understand whether the way I tried is accurate. If not what I should do then?

The following are the scripts I've created using two different languages and found them working like magic.

I got success using javascript:

const puppeteer = require('puppeteer');
function run () {
    return new Promise(async (resolve, reject) => {
        try {
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://www.purplle.com/search?q=hair%20fall%20shamboo");
            let urls = await page.evaluate(() => {
            let items = document.querySelector('h1.br-hdng span');
            return items.innerText;;
            })
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run().then(console.log).catch(console.error);

Again, I got success using python:

import requests_html

with requests_html.HTMLSession() as session:
    r = session.get('https://www.purplle.com/search?q=hair%20fall%20shamboo')
    r.html.render()
    item = r.html.find("h1.br-hdng span",first=True).text
    print(item)

What's wrong with php then?



from Trouble fetching some title from a webpage

No comments:

Post a Comment