7/30/2023 0 Comments Difebot mkplayer![]() ![]() ![]() The full code for creating this job is: $diffbot = new Diffbot ( 'my_token' ) $job = $diffbot -> crawl ( 'sp_search' ) $job -> setSeeds ( ) -> notify ( ) -> setMaxToCrawl ( 1000000 ) -> setMaxToProcess ( 1000000 ) -> setRepeat ( 1 ) -> setMaxRounds ( 0 ) -> setPageProcessPatterns ( ) -> setOnlyProcessIfNew ( 1 ) -> setApi ( $diffbot -> createArticleAPI ( 'crawl' ) -> setMeta ( true ) -> setDiscussion ( false ) ) -> setUrlCrawlPatterns ( ) $job -> call ( ) Ĭalling this script via command line ( php job.php) or opening it in the browser has created the job – it can be seen in the Crawlbot dev screen: The job is now configured, and we can call() Diffbot with instructions on how to create it: $job -> call ( ) For information on how to do this, see the README file. Note that with the individual APIs (like Product, Article, Discussion, etc.) you can process individual resources even with the free demo token from, which lets you test out your links and see what data they’ll return before diving into bulk processing via Crawlbot. $api = $diffbot -> createArticleAPI ( 'crawl' ) -> setMeta ( true ) -> setDiscussion ( false ) $job -> setApi ( $api ) We could use the default – Analyze API – which would make Diffbot auto-determine the structure of the data we’re trying to obtain, but I prefer specificity and want it to know outright that it should only produce articles. Now we need to tell the job which API to use for processing. $job -> setSeeds ( ) -> notify ( ) -> setMaxToCrawl ( 1000000 ) -> setMaxToProcess ( 1000000 ) -> setRepeat ( 1 ) -> setMaxRounds ( 0 ) -> setPageProcessPatterns ( ) -> setOnlyProcessIfNew ( 1 ) -> setUrlCrawlPatterns ( ) To set this up, we use the setUrlCrawlPatterns method, indicating that crawled links must start with. So if we pass in, Crawlbot will look through, and the now outdated – this is something we want to avoid, as it would slow our crawling process dramatically, and harvest stuff we don’t need (we don’t want the forums indexed right now). ![]() When passing in a seed URL to the Crawl API, the Crawljob will traverse all subdomains as well. $job -> setSeeds ( ) -> notify ( ) -> setMaxToCrawl ( 1000000 ) -> setMaxToProcess ( 1000000 ) -> setRepeat ( 1 ) -> setMaxRounds ( 0 ) -> setPageProcessPatterns ( ) -> setOnlyProcessIfNew ( 1 )īefore finishing up with the crawljob configuration, there’s just one more important parameter we need to add – the crawl pattern. And of course, we want it to only process the pages it hasn’t encountered before in each new round – no need to extract the same data over and over again, it would just stack up expenses. Looking for should do – every post has this. It is, therefore, in our interest to be as specific as possible with our crawljob’s definition, as to avoid processing pages that aren’t articles – like author bios, ads, or even category listings. When Diffbot processes pages during a crawl, only those that are processed – not crawled – are actually charged / counted towards your limit. $job -> setSeeds ( ) -> notify ( ) -> setMaxToCrawl ( 1000000 ) -> setMaxToProcess ( 1000000 ) -> setRepeat ( 1 ) -> setMaxRounds ( 0 )įinally, there’s the page processing pattern. We’ll set max rounds as 0, to indicate we want this to repeat indefinitely. It’s important to note that repeating means “from the time the last round has finished” – so if it takes a job 24 hours to finish, the new crawling round will actually start 48 hours from the start of the previous round. We also want this job to refresh every 24 hours, because we know SitePoint publishes several new posts every single day. $job -> setSeeds ( ) -> notify ( )Ī site can have hundreds of thousands of links to spider, and hundreds of thousands of pages to process – the max limits are a cost-control mechanism, and in this case, I want the most detailed possible set available to me, so I’ll put in one million URLs into both values. Then, we make it notify us when it’s done crawling, just so we know when a crawling round is complete, and we can expect up to date information to be in the dataset. First, we need to give it the seed URL(s) on which to start the spidering process: $job -> setSeeds ( ) This will create a new crawljob when the call() method is called. $job = $diffbot -> crawl ( 'sp_search' ) The Diffbot instance is used to create access points to API types offered by Diffbot. I now need a job.php file into which I’ll just dump the job creation procedure, as per the README: include 'vendor/autoload.php' use Swader \Diffbot \Diffbot $diffbot = new Diffbot ( 'my_token' ) composer require swader/diffbot-php-client In an empty folder, let’s first install the client library. Jobs can be created through Diffbot’s GUI, but I find creating them via the crawl API is a more customizable experience. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |