advertisement

Summary

ORA includes a Generic Service for extracting data from web pages that are not part of a repository with a custom ORA service. Repositories with custom services are listed on the Repositories help page.

On this help page, any page where the Generic Service determines the contents of ORA's Control Panel is called a "generic page" and the site it is on is called a "generic site".

The Generic Service makes it possible to extract data

  • when a site's custom service is not yet available, or
  • when a site is not popular enough to warrant a custom service, or
  • when a site does not have genealogical data but contains data that is of interest to your research.

The Generic Service extracts data from the following locations:

  • From standard HTML elements and meta elements, such as the page title and page description
  • From extended metadata included with the page:
  • From data tables on the page defined with the HTML TABLE element

Some sites load data in a separate operation after loading the initial page contents. The Generic Service will look for table data in the initial page contents. If it does not find any, it will refresh the Control Panel up to three times in the first 25 seconds. You may see the Control Panel flash as a result of those refresh steps.

If ORA finds table data, it extracts the data as described below under Table Shapes. If ORA exrtacts data from a table, and that table subsequently changes as a result of pagination or some other operation by a script on the page, ORA will refresh the Control Panel.

Like other services, you may enable or disable ORA's Generic Service. When the Generic Service is disabled, the ORA Control Panel will not appear for pages on any generic sites. Each custom service has its own enabled/disabled setting.

The Generic Service can also be enabled or disabled for certain sites as explained in the next section.

Control Panel

When you visit any generic site, the ORA Control Panel will include two buttons, Add and Block.

Added or Blocked

Add and Block Buttons on the ORA Control Panel

That panel appeared on a Death Certificate page on the Provincial Archives of New Brunswick site:

Add and Block Buttons on Provincial Archives of New Brunswick Site

As you can see, ORA has added the hostname to the button labels.

You may ignore the Control Panel. If you do, the same version of the Control Panel will continue to appear on all the pages on the current generic site.

You may click the Add button. If you do, ORA will add the current hostname as a collection under the Generic Service. ORA will reprocess the page, extract the data it finds, and update the ORA Control Panel. For the example page, the Control Panel will appear as follows:

Death Certificate Page on Provincial Archives of New Brunswick

You may click the Block button. If you do, ORA will add the hostname to its blocked hostnames list and the Control Panel will be removed. The Generic Service will not add a Control Panel to any generic site on the blocked hostnames list.

You should use the Block button when you do not extract data from the site as part of your research. So, for example, you should block ORA on shopping sites, sites for other hobbies, etc.

ORA initializes your block list to include:

  • several search engines including Google, Bing, and DuckDuckGo,
  • several eCommerce sites including Amazon and eBay, and
  • several social media sites including Facebook and Instagram.

You may edit the list to reflect your preferences, and as described above, you may use the ORA Control Panel to add more entries.

Collections

When you use the Add button in the Control Panel, ORA adds the current hostname as a collection under the Generic Service. Any Text Template or Auto Type Template that you add for that collection will apply to any pages with the same hostname.

This is different from custom services where ORA adds collections to the service that correspond to collections provided by the repository. Unfortunately, it takes repository-specific logic to determine the repository collection associated with a page, and that logic is not available with the Generic Service because it applies to multiple repositories and other sites.

One effect of having one collection per hostname is any Text Template or Auto Type Template that you add will apply to all pages that share the same hostname.

Another effect of using hostnames as collections is it may be more difficult for you to associate the collection name (the hostname) with the name of the site. "Provincial Archives of New Brunswick" is easier to understand than "archives.gnb.ca".

Hostnames

ORA determines the hostname of a generic site from the host portion of the URL of the current page. The host portion of the URL includes an optional prefix and the domain name. In www.example.com, the prefix is "www." and the domain name is "example.com". The URL often includes a path after the domain name. If a path is present, it begins with the slash ("/") character immediately after the domain name.

When the host portion of a URL begins with www., ORA will remove the www. prefix and use the rest of the host portion as the hostname. So, for www.example.com, the ORA hostname is example.com.

For other hostnames, such as archives.gnb.ca from the example above, ORA will use the full value of the host portion of the URL.

Many sites use multiple hostnames. Often, the hostnames vary by country, such as "www.ebay.com" and "www.ebay.co.uk", but the content may or may not be the same. For generic sites, ORA does not attempt to combine such hostnames into a single collection. If you visit a site and the ORA Control Panel doesn't operate the way you expect, it may be because the hostname is different from prior sites you have visited from the same organization.

URLs in Templates

URLs often include components that identify the collection assigned by the repository. If you see a collection ID or some other indicator of the record type of the current page in a URL, you may want to add an outer conditional expression to the template so it only produces output when that collection ID is present in the URL of the page.

For example, the Provincial Archives of New Brunswick site includes an obvious collection ID in some of its URLs. On its Marriage Bond detail pages, for example, the URL includes "RS551A":

https://archives.gnb.ca/Search/RS551A/Details.aspx?Key=3492

If you want a template to produce output only when "RS551A" is part of the URL, you could use this approach:


<[?:Page.URL/RS551A/]

... Put your template text here ...
>

The conditional expression begins on the first line and contains all the other text in the template because the ending ">" is the last character in the template. "[?:...]" is a Value Test that uses a Regular Expression to search for "RS551A" in the Page.URL value. if the value test is false, the rest of the template is ignored and the template will produce no text.

If a Text Template produces no text, the template will not add any field to the Control Panel.

If you click a button for an Auto Type Template, and the template produces no text, ORA will issue a warning message, "The Auto Type template did not produce any non-blank text."

Table Shapes

ORA searches for tables on generic pages. ORA will attempt to extract data from any table that matches one of these shapes:

ORA's Generic Service works best when the generic page follows the HTML standard for the TABLE element and its child elements. Unfortunately, many generic pages misuse HTML TABLEs. ORA does its best to handle those pages, but incorrect TABLE usage may prevent ORA from recogizing some data tables or may negatively influence the Control Panel contents.

There are special considerations for pages that include two or more data tables. See the Multiple Data Tables section.

Field per Row

In a "field per row" table, there are two columns. The first column is the field name, and the second column is the fielfd value.

ORA adds a radio button to the first data cell in the table. Click the radio button to update the ORA Control Panel to show the contents of the table.

Example

Name Jane Jones
Date 25 AUG 1946
Place New York, NY

Fields:

  • Name
  • Date
  • Place

Multiple Data Rows

A "multiple data rows" table has one or more heading rows followed by one to many data rows. Each data row creates a set of field values.

ORA adds a radio button to the first data cell in each row. You can click any radio button to select a row. The Control Panel shows the field values from the selected row.

Example 1

Date Bride Groom
25 AUG 1946 Jane Jones James Smith
3 APR 1958 Jane (Jones) Smith Roger Sheppard

Fields:

  • Date
  • Bride
  • Groom

Example 2

If the generic site uses the correct HTML, ORA can process complex heading rows.

Date Bride Groom
Given Surname Given Surname
25 AUG 1946 Jane Jones James Smith
3 APR 1958 Jane (Jones) Smith Roger Sheppard

Fields:

  • Date
  • Bride Given
  • Bride Surname
  • Groom Given
  • Groom Surname

Example 3

If the heading row uses the same label more than once, ORA will add a sequence number.

Date Given Surname Given Surname
25 AUG 1946 Jane Jones James Smith

Fields:

  • Date
  • Given 1
  • Surname 1
  • Given 2
  • Surname 2

Up and Down

An "up and down" table has pairs of rows where the first row in a pair contains headings and the second row in the pair contains values.

ORA adds a radio button to the first data cell in the table. Click the radio button to update the ORA Control Panel to show the contents of the table.

Example

Date Bride Groom
25 AUG 1946 Jane Jones James Smith
Place Minister
Watertown, NY Curtis Martin

Fields:

  • Date
  • Bride
  • Groom
  • Place
  • Minister

Multiple Data Tables

There are some special considerations for generic pages that include two or more data tables.

As described above in the Table Shapes section, the ORA Control Panel includes data from the table with the selected radio button. For the Field per Row and Up and Down table shapes, there is one radio button per table. For the Multiple Data Rows table shape, there is one radio button per data row.

For the Field per Row and Up and Down table shapes, even when the radio button for the table is not selected, ORA will process the fields in the table:

  • The fields will be available to use in templates.
  • The fields will not be visible in the Control Panel.
  • The fields will not be copied if you click one of the clipboard copy buttons and you do not press the ALT key.
  • The fields will be copied if you click one of the clipboard copy buttons and you do press the ALT key.
  • The field names for the hidden fields will start with the prefix T followed by the sequential number of the data table relative to the other data tables

The rules above are useful when a generic page has two tables that are related, such as when one table has fields that apply to multiple rows in another table.

Example

In the example page below from the FreeCen site, the first table includes fields that apply to all the rows in the second table. The selected radio button is in the first row of the second table, so those fields are shown in the Control Panel.

Two Data Tables on the Same Generic Page

If you copy the fields to a spreadsheet by holding ALT and then clicking the copy-vertical button near the top of the ORA Control Panel, you will see this:

Page.Title FreeCEN - UK Census Records (England, Scotland, Wales)
Page.Access Date 03 November 2020
Collection ID freecen.org.uk
T1.Census Year 1871
T1.County Somerset (SOM)
T1.Place Oldland
T1.Civil Parish Mangotsfield
T1.Ecclesiastical Parish Mangotsfield
T1.Piece 2500
T1.Enumeration District 2
T1.Folio 40
T1.Page 29
T1.Schedule 161
T1.House or Street Name Soundwell
Surname SHEPPARD
Forenames Giles
Relationship Head
Marital Status M
Sex M
Age 42
Occupation Coal Miner
Birth County GLS
Birth Place Mangotsfield
Source.Title FreeCEN - UK Census Records (England, Scotland, Wales)

Note that fields from the first table are included with field names preceded by "T1.".

Limitations

The Generic Service will do a great job on some pages and a not-so-great job on others. There are several problem areas.

  1. Some pages use HTML tables for content that is not tabular data.

    ORA attempts to ignore tables that are not filled with tabular data, but that is not always possible. As a result, you may see ORA's radio button added to content on the page, and the ORA Control Panel filled with fields that have unusual headings and values. This should not pose a problem other than momentary confusion over the Control Panel contents. If there is an actual table on the page, click a radio button in that table. Otherwise, ignore the "table that isn't a table".

  2. Some sites do not use semantically-correct HTML tables.

    Many sites implement tables incorrectly.

    • A common issue is sites use the TD (table data) element for heading values. The proper element for a heading is TH (table heading).
    • Many sites put heading rows inside the TBODY element even though the main heading rows should be inside the THEAD element.

    ORA will attempt to process tables that are not semantically correct, and the results will often be quite good. However, some poorly-formed tables will cause incorrect results in ORA.

  3. Some sites fetch data after the initial page is loaded.

    If you see a table, but the ORA Control Panel does not contain the fields you expect to see, try clicking the Refresh icon in the ORA Control Panel. That will force ORA to reexamine the page. If table data has been added after the last time ORA processed the page, refreshing the Control Panel will solve the issue.

  4. Some sites do not use tables to present data.

    Some sites use lists where it is not clear from the HTML structure what is a label and what is a value.

    Some sites use HTML elements that are styled to look like a table, but aren't actually tables.

    ORA's Generic Service only processes data stored in HTML tables.

  5. Some sites have scripts that interfere with ORA.

    On some sites, ORA may add a Control Panel and populate it with the first set of data from a table. When you attempt to select a different row, nothing will change. The likely cause is a script used by the current site that prevents ORA from reacting to the selection change.

Generic Sites

Visit the Generic Sites page to see a list of generic sites that have been assessed for use with ORA's Generic Service.

On This Page