Botmation Documentation

Scrapers

Scrapers

These BotAction's scrape the Page's document by doing something within the Page's context.

Scrapers use a HTML parser to convert the outerHTML property of an HTML Element(s) into an interactive object.

The default HTML parser is cheerio's load() function.

You can override the HTML parser by providing a function to operate on the outerHTML via the optional higher-order param higherOrderHTMLParser, or the optional first inject. If both are provided, the higher-order param will override the injected one, so if you want to use multiple, feel free too.

HTML Parser

This higher-order BotAction will inject the provided HTML parser function of the first htmlParser() call, into the assembled BotAction's of the second call htmlParser()().

A method for BotAction's implementing the ScraperBotAction interface to override the default HTML parser.

Assembled BotAction's can override the injected HTML parser by providing their own via their own optional higher-order param.

const htmlParser = (htmlParser: Function) =>
(...actions: BotAction[]): BotAction =>
pipe()(
inject(htmlParser)(
errors('htmlParser()()')(...actions)
)
)

For an usage example, see the LinkedIn example.

Scrape Single

This ScraperBotAction will parse the HTML of the first Element in the document that matches the provided selector.

const $ = <R = CheerioStatic>(htmlSelector: string, higherOrderHTMLParser?: Function): ScraperBotAction<R> =>
async(page, injectedHTMLParser) => {
let parser: Function
if (!higherOrderHTMLParser) {
if (injectedHTMLParser) {
parser = injectedHTMLParser
} else {
parser = cheerio.load
}
} else {
parser = higherOrderHTMLParser
}
const scrapedHTML = await page.evaluate(getElementOuterHTML, htmlSelector)
return parser(scrapedHTML)
}

Example:

await pipe()(
goToDashboard, // fake -> load site with "dashboard"
// look at the header for the badge UI representing # of notifications
$('header .user .notifications .badge'),
// by default, it returns a CheerioStatic instance
map(notificationsCount => notificationsCount.text()), // grab the text from the "element"
log('# of Notifications') // Pipe: 2
)(page)

Scrape Multiple

This ScraperBotAction will parse the HTML of all Elements in the document that match the provided selector.

const $$ = <R = CheerioStatic[]>(htmlSelector: string, higherOrderHTMLParser?: Function): ScraperBotAction<R> =>
async(page, ...injects) => {
let parser: Function
// Future support piping the HTML selector with higher-order overriding
const [,injectedHTMLParser] = unpipeInjects(injects, 1)
if (higherOrderHTMLParser) {
parser = higherOrderHTMLParser
} else {
if (injectedHTMLParser) {
parser = injectedHTMLParser
} else {
parser = cheerio.load
}
}
const scrapedHTMLs = await page.evaluate(getElementsOuterHTML, htmlSelector)
const cheerioEls: CheerioStatic[] = scrapedHTMLs.map(scrapedHTML => parser(scrapedHTML))
return cheerioEls as any as R
}

Example:

await pipe()(
goToNewsFeed, // fake -> load site with "news feed"
// selector to grab each "element" from the Feed container
// Each element represents a Post in the Feed with children
// elements representing data like Author, Date, Actual Text, etc
$$('section.app .feed [data-id]'),
// by default, it returns a CheerioStatic[]
forAll()( // loop the CheerioStatic[]
feedPostEl => ([
// scroll and "like" the post IF the post belongs to a friend
// in the `friends` array (ie by name matching post authors)
givenThat(postBelongsTo(feedPostEl, ...friends))(
scrollTo(feedPostEl.attr('id')),
like(feedPostEl)
)
])
)
)(page)

Evaluate

This BotAction isn't a Scraper but an important building block for scraping. It's a simple BotAction that wraps Puppeteer Page's evaluate() method to execute JavaScript in the Page. It takes a function, and parameters and returns what the function returns.

const evaluate = (functionToEvaluate: EvaluateFn<any>, ...functionParams: any[]): BotAction<any> =>
async(page) => await page.evaluate(functionToEvaluate, ...functionParams)

For an example, see the Navigation BotAction scrollTo()

Helpers

In order to "scrape" the page's document's HTML, these Helper functions are evaluated in the page's context. To avoid race conditions with the document's nodes, the outerHTML property (string representation of the HTML element and children) is returned to the NodeJS context than parsed with a HTML parser.

getElementOuterHTML()

const getElementOuterHTML = (htmlSelector: string): string|undefined =>
document.querySelector(htmlSelector)?.outerHTML

getElementsOuterHTML()

const getElementsOuterHTML = (htmlSelector: string): string[] =>
Array.from(document.querySelectorAll(htmlSelector)).map(el => el.outerHTML)
Edit this page on GitHub
Baby Bot