Published on

Week 42 in review

Authors

In a week where Facebook introduced the world to the metaverse, HARNESS have been exploring cosmos, wrestling with certs and welcoming a new galactic explorer to the team.

Week 42 header

What's happened this week?

PDFx

Extended bulk comparison

As part of our ongoing alogithm training we have continued to test the improvements on a number of documents contained within the HARNESS PDF library. We are currently compiling an article regarding our rigourous testing processes for the algorithm changes but essentially, each update to PDFx is run against a subset of PDFs (1,000 documents) where the benchmark data holds the 'true' values. This week, we decided to expand it further and trial PDFx against all of the 75,000 PDF's within the library. Whilst a few of them may be EPC certificates, land registry titles etc. it's essential that PDFx retains it's ability to correctly identify documents types and return the correct datapoints for that document.

As you may appreciate, reviewing the output of this huge bulk run is no small task. We will look to start the task of reviewing the output early next week.

Wrestling the certificates

The development team had some ‘fun’ these past weeks setting up their local environments to serve our web apps over HTTPS. There’s a few reasons to bump your local server to HTTPS, and in our case we have two projects that share a cookie and Google Chrome has tightened up their same-site cookie requirements – Chrome now considers cross-scheme requests to be for different sites, and since one service ran on HTTPS we had to run the other one on HTTPS, too.

This involved a number of steps, and in the end we resolved to schedule in a change to our identity process, specifically using JWT bearer tokens in the future. We already set up some other projects that way, and it’s very straightforward.

For this project however, we’ve had to jump through the usual hoops: create a local Certificate Authority; create a cert for a local site; direct that domain to the local address; update Vue to expose the certificate, and serve HTTPS; and of course getting the other project to serve an identical/matching certificate for that some domain, too!

Between .net and Vue, the methods differ quite a bit - .net wants .pfx files, and it needs some middleware config in the ASP.NET Core pipeline, and of course all of this requires you install a root CA and that’s different on macs and windows machines…

Exploring the cosmos

This week we’ve been experimenting with Azure’s Cosmos DB as a reporting repository for the data we extract from PDFs. The size and varied nature of the data means a Relational Database such as SQL Server doesn’t give us the flexibility we need as we expand into new sectors and work with new document types.

Although PDFx is built on JSON Files and document storage, Cosmos DB's SQL API allows us to drill into the data we’ve extracted and provide our clients with a greater insight into their data than before. To get started we manually imported the extracted data from several docs. We’ve got to admit, although it wasn’t as straight forward as we had thought, we do like the syntax of SQL API and after a couple of hours we'd produced a couple of queries for some real-life requests that have come in from clients recently.

The following query returns the tenancy counts for each document in the database

SELECT c.id, ARRAY_LENGTH(myfields["value"]["fields"][0]) AS noOfTenancies
FROM c
JOIN (
    SELECT VALUE d
    FROM d IN c.extraction.document_parts[0].data_points.fields[0]
    WHERE d.key = "TenancySchedule" AND d["value"]["case"] = "DataPointDictionaryArray"
) as myfields

and the response looks like

ADDRESSABLE

Introduction of partial address matches

The addressable data scientists have introduced the concept of partially matched addresses which are linked via Building Polygons or TOID's. Whilst these are only available in our internal instance of ADDRESSABLE we hope that this will allow us to continue expanding and connecting data that otherwise w ould have been lost by most address matching / data analytics tools.

The above screneshot shows a partial record found via the Experian Shop Point data set.

By matching it via the building polygon, we can start to see more data relating to that entry.

In other news

HARNESS welcome Andrew Smith as CRO

This week, we welcomed Andrew Smith to the team. Andrew joins us as Chief Revenue Officer. Although Andrew only joined us on Wednesday he has already had face to face(!!) meetings with the business development team and a number of our clients.

A full announcement will be made soon... but for now, here's 5 facts about Andrew!

Introducing Andrew - Chief Revenue Officer

What's coming up

Articles

Next week, we'll introduce our first article regarding data extraction. This article will be a small introduction into data extraction as a whole, the benefits of automated data extraction and how it can benefit businesses of all sizes. In the near future we plan to expand to include our internal quality assurance processes and beyond.

PPSM

The latest PPSM release has now been finalised and will published mid-next week. If you haven't already, please ensure that you have signed up to receive the latest release, straight to your inbox!

Interested in finding out more? We'd love to chat! Contact us →

Want to be kept up to date?

Sign up for our newsletter and you'll receive updates, straight to your inbox.

Register for updates

We care about the protection of your data. Read our Privacy Policy.