Wednesday, 18 July 2018

Determining chapter number in different types of text

I'm pulling titles from novel related posts. The aim is, via use of regex, to determine which chapter(s) the post is about. Each site uses different ways of identifying the chapters. Here are the most common cases:

$title = 'text chapter 25.6 text'; // c25.6
$title = 'text chapters 23, 24, 25 text'; // c23-25
$title = 'text chapters 23+24+25 text'; // c23-25
$title = 'text chapter 23, 25 text'; // c23 & 25
$title = 'text chapter 23 & 24 & 25 text'; // c23-25
$title = 'text c25.5-30 text'; // c25.5-30
$title = 'text c99-c102 text'; // c99-102
$title = 'text chapter 99 - chapter 102 text'; // c99-102
$title = 'text chapter 1 - 3 text'; // c1-3
$title = '33 text chapter 1, 2 text 3'; // c1-2
$title = 'text v2c5-10 text'; // c5-10
$title = 'text chapters 23, 24, 25, 29, 31, 32 text'; // c23-25 & 29 & 31-32

The chapter numbers are always listed in the title, just in different variations as displayed above.

What I have so far

So far, I have a regex to determine single cases of chapters, like:

$title = '9 text chapter 25.6 text'; // c25.6

Using this code (try ideone):

function get_chapter($text, $terms) {

    if (empty($text)) return;
    if (empty($terms) || !is_array($terms)) return;

    $values = false;

    $terms_quoted = array();
    foreach ($terms as $term)
        $terms_quoted[] = preg_quote($term, '/');

    // search for matches in $text
    // matches with lowercase, and ignores white spaces...
    if (preg_match('/('.implode('|', $terms_quoted).')\s*(\d+(\.\d+)?)/i', $text, $matches)) {
        if (!empty($matches[2]) && is_numeric($matches[2])) {
            $values = array(
                'term' => $matches[1],
                'value' => $matches[2]
            );
        }
    }

    return $values;
}

$text = '9 text chapter 25.6 text'; // c25.6
$terms = array('chapter', 'chapters');
$chapter = get_chapter($text, $terms);

print_r($chapter);

if ($chapter) {
    echo 'Chapter is: c'. $chapter['value'];
}

How do I make this work with the other examples listed above? Given the complexity of this question, I will bounty it 200 points when eligible.



from Determining chapter number in different types of text

No comments:

Post a Comment