How do I select what sites to scrape?

If you are using the article downloader or article creator it may be confusing as to why there is a Google search option and why there is custom sources, that have Google in the options again.

You may be wondering,

If I do not select a custom source (eg Ezine) and only choose a Google region (eg Google USA), will SCM only scrape Google USA?
Why does it scrape Google + Google blog when only Google is selected?

How SCM looks for content

The program will only scrape content from sources that you enable.

In fact, you can select no sources at all the Article creator task will still run.
The reason for this is some people want to load in their own content.

If you are scraping content you have 2 choices:

1) Google search

All Google searches will return results from both search + blogs (about 150 links)

2) Custom sources

Under custom sources you have a couple of options again,
2a) Bing/BingCache/Google/Custom Search Engine.

Custom sources allows you to specify specific sites or your own search engines to find content.

How to verify what is happening

If you look in the application log, you can verify what sites and how many links are being found by the scraper.

Application log

admin has written 31 articles

2 thoughts on “How do I select what sites to scrape?

  1. Mike says:

    What if I want to use a different search engine than Google all together? Say Dogpile, DuckDuckGo, Bing, Yandex, etc.

    It seems to crash (task stays ‘waiting’ and scheduler says ‘paused’ until I un-select these alternatives

    1. admin says:

      You should be able to set them up as a custom search engine. The “waiting” means there is a task ahead of it in the scheduler waiting to run.

      It could also be that the status of the task is wrong and needs to be manually cleared.

      If you can supply me with setup details of your Yandex etc I can test it for you.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.