A very powerful feature of Cloud Preservation is its ability to collect external links. External links are links to web pages or documents that are outside of the website or social media feed being collected.
In terms of website feeds, Cloud Preservation determines if a link is external by comparing the address of the link to the addresses defined in your feed. In the context of Cloud Preservation social media feeds (such as Twitter or Facebook) an external link is a link that was found in a post from the social media feed.
Cloud Preservation provides you four configurable options for how it will manage external links. These options allow you to tailor your feeds to meet your collection needs and also provide a level of control over your feed’s storage use.
Option 1: Never collect external links
This option allows you to ignore offsite links entirely. When the Cloud Preservation crawler encounters a link that it determines to be external, it will record that link, but will not collect the web page at that link’s address. Since this option leaves these external pages out of your repository completely, these external links have no impact on your feed’s storage use.
When to use this option: There is no requirement to collect external pages, and/or there isn’t enough storage capacity for external pages in the Cloud Preservation repository for the selected plan.
Option 2: Never collect modified versions of external links
With this option selected, Cloud Preservation will look to see if it has ever collected this external link before, by comparing the address to all of the addresses of pages it has collected in the past. If it finds another page in the repository that bears this same address, then Cloud Preservation will simply link the existing page to the currently running crawl. Of all the options to collect external links, this has the lowest impact on storage for the repository.
When to use this option: There is a requirement to collect external pages, however the latest version isn’t important or of consequence. Often times for social media feeds like Twitter, the external page modifications aren’t relevant. For example, the external link could be an article or blog post with constantly changing advertisements and user comments that aren’t important or relevant for your collection.
Option 3: Collect modified versions of external links for new or modified pages
If Cloud Preservation crawls an internal page that has not changed since the last collection, then it will not attempt to fetch the latest version of any external links. However, if the page has changed since the last collection, or is a page that has not been collected previously, then Cloud Preservation will check for new versions of all external links on that page. This option is slightly less efficient in terms of repository storage, but does offer savings over the final option.
Note: This is the default setting for new Cloud Preservation feeds, as we’ve found it to be the best choice for enhancing your collection with external links while keeping storage use in check.
When to use this option: There is a requirement to collect a “point in time” snapshot of both the internal pages and the external pages.
Option 4: Always collect the latest external link
Finally, this option will always attempt to fetch the latest version of the external link. If the link is found on a new internal page, modified internal page, or unmodified internal page, Cloud Preservation will crawl the external link to see if there is a new version. This option will have the largest impact on storage, as external pages frequently change due to rotating advertisements or images and changed content.
When to use this option: Useful when the latest version of offsite pages must be collected, always, and there is a surplus in storage capacity for the Cloud Preservation plan chosen. This option is also necessary for some advanced crawling techniques, such as using a single internal web page whose purpose is to provide an index of several external links.
The crawling process of Cloud Preservation can get complicated, just like the web, and we hope this sheds a bit of light on the subject of external links.