Skip to content

🎉 New Source: Google Search Console #5350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Sep 10, 2021

Conversation

gaart
Copy link
Contributor

@gaart gaart commented Aug 11, 2021

What

Move Google Search Console from singer to native CDK

How

Rewrite the connector using Python CDK.

GSC has several notable features:

  1. Authorization. GSC implements 2 ways of authorization: authorization through a service account or personal.
  • In spec.json, we have optional authorization fields, we provide the choice for the user.
  • We implement ServiceAccountAuthenticator, which is based on Oauth2Authenticator.
  • The connector will automatically choose the first correct authorization type (if both are correct, it will choose personal authentication).
  • Getting personal credentials is not a trivial task, so we have a credentials directory where we have several scripts to make this process more user-friendly.
  1. Pagination.
  • A user can have more than 1 website to download data, so we have to download data from each one.
  • The same functionality is used in the searchType field (only for analytic streams). We have 4 searchType and we have to load data from each one.
  • These conditions are converted to a nested loop, but the CDK does not have the ability to directly implement these loops. So, we use a little trick with next_page_token - we pretend our websites and searchTypes are just different pages, and this iterative functionality is implemented in the next_page_token method.

Recommended reading order

  1. x.java
  2. y.python

Pre-merge Checklist

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions
  • Connector added to connector index like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

@github-actions github-actions bot added the area/connectors Connector related issues label Aug 11, 2021
@gaart gaart changed the title Gaart/5235 source google search console Source: Google Search Console Aug 11, 2021
@gaart gaart changed the title Source: Google Search Console New Source: Google Search Console Aug 11, 2021
@gaart gaart changed the title New Source: Google Search Console Source: Google Search Console Aug 11, 2021
@gaart gaart linked an issue Aug 11, 2021 that may be closed by this pull request
@@ -0,0 +1,16 @@
FROM python:3.7-slim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use alpine, see source-zendesk-support for an example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried an image from the zendesk connector, got timezone detection error during testing, rolled back to python:3.7-slim

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TZ issue should be fixed, see latest templates

@@ -0,0 +1,5 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be in a dir called secrets/ otherwise creds will get checked into source control by accident

@jrhizor jrhizor temporarily deployed to more-secrets August 17, 2021 06:50 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets August 17, 2021 08:49 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets August 17, 2021 09:18 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets August 17, 2021 10:09 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets August 17, 2021 13:30 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets August 18, 2021 10:11 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets August 18, 2021 10:20 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets August 18, 2021 10:28 Inactive
…-source-google-search-console

# Conflicts:
#	.github/workflows/publish-command.yml
#	.github/workflows/test-command.yml
@gaart gaart requested a review from sherifnada September 2, 2021 13:45
@@ -306,6 +306,11 @@
dockerRepository: airbyte/source-pokeapi
dockerImageTag: 0.1.1
documentationUrl: https://docs.airbyte.io/integrations/sources/pokeapi
- sourceDefinitionId: eb4c9e00-db83-4d63-a386-39cfa91012a8
name: Google Search Console (native)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaart reminder about this comment


class Sitemaps(GoogleSearchConsole):
"""
API docs: https://developers.google.com/webmaster-tools/search-console-api-original/v3/sitemaps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for including API docs links, makes it much easier to see what's going on!


def parse_response(self, response: requests.Response, **kwargs) -> Iterable[Mapping]:
records = response.json().get(self.data_field) or []
for record in sorted(records, key=lambda x: x["keys"][0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just so that I'm on the same page, so the sorting was not needed?

Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! a few more points here:

  1. I think state updates are not functioning as intended due to interaction with stream slices
  2. we should remove the singer connector from the connector index
  3. need to go through the checklist to create needed docs
  4. can you add a bootstrap doc like we discussed in retro?


yield record

def get_updated_state(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't this is correct? it needs to be scoped by searchType. Currently if you sync only 1 search Type then the connector fails, the state output will just say date: September 1st which means that we will skip syncing data from previous dates for all other search types. I could be reading this wrong though?

let me know if I can help explain stream slices more clearly, it's definitely not an obvious concept at first

also can you add a unit test for testing state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sherifnada The functionality of state updates has been updated, but now we are faced with the problem of the acceptance test failing due to nesting, we cannot tell the test to get a value from a nested structure with dynamic fields.
Any suggestions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

yield record


class SearchAnalyticsAllFields(SearchAnalytics):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should put a warning in the docs that this will take a very long time to sync because it's so granular

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Sep 7, 2021
@gaart
Copy link
Contributor Author

gaart commented Sep 7, 2021

/test connector=connectors/source-google-search-console

🕑 connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1209277304
❌ connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1209277304

@jrhizor jrhizor temporarily deployed to more-secrets September 7, 2021 11:47 Inactive
@gaart
Copy link
Contributor Author

gaart commented Sep 7, 2021

/test connector=connectors/source-google-search-console

🕑 connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1209407271
❌ connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1209407271

@jrhizor jrhizor temporarily deployed to more-secrets September 7, 2021 12:30 Inactive
@gaart
Copy link
Contributor Author

gaart commented Sep 8, 2021

/test connector=connectors/source-google-search-console

@gaart
Copy link
Contributor Author

gaart commented Sep 8, 2021

/test connector=connectors/source-google-search-console

🕑 connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1213601944
✅ connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1213601944

@jrhizor jrhizor temporarily deployed to more-secrets September 8, 2021 13:58 Inactive
@@ -29,6 +29,7 @@
| Google Ads | [![source-google-ads](https://img.shields.io/endpoint?url=https%3A%2F%2Fdnsgjos7lj2fu.cloudfront.net%2Ftests%2Fsummary%2Fsource-google-ads%2Fbadge.json)](https://dnsgjos7lj2fu.cloudfront.net/tests/summary/source-google-ads) |
| Google Adwords | [![source-google-adwords-singer](https://img.shields.io/endpoint?url=https%3A%2F%2Fdnsgjos7lj2fu.cloudfront.net%2Ftests%2Fsummary%2Fsource-google-adwords-singer%2Fbadge.json)](https://dnsgjos7lj2fu.cloudfront.net/tests/summary/source-google-adwords-singer) |
| Google Analytics | [![source-googleanalytics-singer](https://img.shields.io/endpoint?url=https%3A%2F%2Fdnsgjos7lj2fu.cloudfront.net%2Ftests%2Fsummary%2Fsource-googleanalytics-singer%2Fbadge.json)](https://dnsgjos7lj2fu.cloudfront.net/tests/summary/source-googleanalytics-singer) |
| Google Search Console | [![source-google-search-console](https://img.shields.io/endpoint?url=https%3A%2F%2Fdnsgjos7lj2fu.cloudfront.net%2Ftests%2Fsummary%2Fsource-google-search-console%2Fbadge.json)](https://dnsgjos7lj2fu.cloudfront.net/tests/summary/source-google-search-console) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you insert it in alphabetical order?

@@ -0,0 +1,55 @@
# Google Search Console

The API docs: https://developers.google.com/webmaster-tools/search-console-api-original/v3/parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The API docs: https://developers.google.com/webmaster-tools/search-console-api-original/v3/parameters.
From [the docs](https://support.google.com/webmasters/answer/9128668?hl=en):
Google Search Console is a free service offered by Google that helps you monitor, maintain, and troubleshoot your site's presence in Google Search results.
Search Console offers tools and reports for the following actions:
* Confirm that Google can find and crawl your site.
* Fix indexing problems and request re-indexing of new or updated content.
* View Google Search traffic data for your site: how often your site appears in Google Search, which search queries show your site, how often searchers click through for those queries, and more.
* Receive alerts when Google encounters indexing, spam, or other issues on your site.
* Show you which sites link to your website.
* Troubleshoot issues for AMP, mobile usability, and other Search features.
The API docs: https://developers.google.com/webmaster-tools/search-console-api-original/v3/parameters.

we have to deal with a large dataset.

In order to reduce the amount of data, and to retrieve a specific dataset (for example, to get country specific data)
we can use SearchAnalyticsByCountry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
we can use SearchAnalyticsByCountry.
we can use SearchAnalyticsByCountry. So each of the SearchAnalytics streams groups data by certain dimensions like date, country, page, etc...

To chose one we use an authorization field with the `oneOf` parameter in the `spec.json` file.


## Analytics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section should be in the source code since if the code changes this can go out of date, so putting it next to the code reduces the chance it goes out of date. . The bootstrap should be more focused on higher level details of how the API works

@@ -82,6 +80,15 @@ If you lose this key pair, you will need to generate a new one!
### Note
You can return to the [API Console/Credentials](https://console.cloud.google.com/apis/credentials) at any time to view the email address, public key fingerprints, and other information, or to generate additional public/private key pairs. For more details about service account credentials in the API Console, see [Service accounts](https://cloud.google.com/iam/docs/understanding-service-accounts) in the API Console help file.

### Using the existing User Account
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to add this section to the external docs as it will be automated via oauth in the UI. This is an excellent README however to place in the credentials/ directory.


## How to create the client credentials for Google Search Console, to use with Airbyte?

You can either:
* Use the existing `Service Account` for your Google Project with granted Admin Permissions
* Use the existing `User Account` for your Google Project with granted Admin Permissions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Use the existing `User Account` for your Google Project with granted Admin Permissions
* Use your personal Google User Account with oauth. If you choose this option, your account must have permissions to view the Google Search Console project you choose.

@@ -42,7 +42,7 @@ Airbyte uses a grading system for connectors to help users understand what to ex
|[Google Analytics](./sources/googleanalytics.md)| Beta |
|[Google Analytics v4](./sources/google-analytics-v4.md)| Beta |
|[Google Directory](./sources/google-directory.md)| Certified |
|[Google Search Console](./sources/google-search-console.md)| Beta |
|[Google Search Console](./sources/google-search-console.md)| Alpha |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should still be beta since we are committed to supporting it

Suggested change
|[Google Search Console](./sources/google-search-console.md)| Alpha |
|[Google Search Console](./sources/google-search-console.md)| Beta |

current_stream_state = {site_url: {search_type: {self.cursor_field: latest_benchmark}}}

# this line is required to pass the acceptance test successfully
current_stream_state[self.cursor_field] = current_stream_state[site_url][search_type][self.cursor_field]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way to avoid having this here i.e; adding information only to use it in tests? Is the problem that we want the max over all search types and the yaml format doesn't support that? if so let's add this comment to help the next dev understand

Suggested change
current_stream_state[self.cursor_field] = current_stream_state[site_url][search_type][self.cursor_field]
# we need to get the max date over all searchTypes but the current acctest YAML format doesn't
# support that
current_stream_state[self.cursor_field] = current_stream_state[site_url][search_type][self.cursor_field]

@gaart
Copy link
Contributor Author

gaart commented Sep 10, 2021

/test connector=connectors/source-google-search-console

🕑 connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1220120844
✅ connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1220120844

@jrhizor jrhizor temporarily deployed to more-secrets September 10, 2021 05:58 Inactive
@gaart
Copy link
Contributor Author

gaart commented Sep 10, 2021

/publish connector=connectors/source-google-search-console

🕑 connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1220195151
✅ connectors/source-google-search-console https://github.com/airbytehq/airbyte/actions/runs/1220195151

@jrhizor jrhizor temporarily deployed to more-secrets September 10, 2021 06:29 Inactive
@gaart gaart merged commit c0d4652 into master Sep 10, 2021
@gaart gaart deleted the gaart/5235-source-google-search-console branch September 10, 2021 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Source: Airbyte native Google Search Console
5 participants